Interpretability Framework for LLMs in Undergraduate Calculus

2510.17910v1 cs.CY, cs.AI, cs.CL 2025-10-23

Авторы:

Sagnik Dakshit, Sushmita Sinha Roy

Abstract

Large Language Models (LLMs) are increasingly being used in education, yet their correctness alone does not capture the quality, reliability, or pedagogical validity of their problem-solving behavior, especially in mathematics, where multistep logic, symbolic reasoning, and conceptual clarity are critical. Conventional evaluation methods largely focus on final answer accuracy and overlook the reasoning process. To address this gap, we introduce a novel interpretability framework for analyzing LLM-generated solutions using undergraduate calculus problems as a representative domain. Our approach combines reasoning flow extraction and decomposing solutions into semantically labeled operations and concepts with prompt ablation analysis to assess input salience and output stability. Using structured metrics such as reasoning complexity, phrase sensitivity, and robustness, we evaluated the model behavior on real Calculus I to III university exams. Our findings revealed that LLMs often produce syntactically fluent yet conceptually flawed solutions, with reasoning patterns sensitive to prompt phrasing and input variation. This framework enables fine-grained diagnosis of reasoning failures, supports curriculum alignment, and informs the design of interpretable AI-assisted feedback tools. This is the first study to offer a structured, quantitative, and pedagogically grounded framework for interpreting LLM reasoning in mathematics education, laying the foundation for the transparent and responsible deployment of AI in STEM learning environments.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Interpretability Framework for LLMs in Undergraduate Calculus

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Can machines perform a qualitative data analysis? Reading the debate with Alan T...

Large Language Models' Complicit Responses to Illicit Instructions across Socio-...

A Cross-Cultural Assessment of Human Ability to Detect LLM-Generated Fake News a...

Place Matters: Comparing LLM Hallucination Rates for Place-Based Legal Queries

AI-generated podcasts: Synthetic Intimacy and Cultural Translation in NotebookLM...

Навигация