CogMath: Assessing LLMs’ Authentic Mathematical Ability from a Human Cognitive Perspective

Authors: Jiayu Liu, Zhenya Huang, Wei Dai, Cheng Cheng, Jinze Wu, Jing Sha, Song Li, Qi Liu, Shijin Wang, Enhong Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By applying Cog Math on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30%-40%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
Researcher Affiliation Collaboration 1School of Artificial Intelligence and Data Science, University of Science and Technology of China 2State Key Laboratory of Cognitive Intelligence 3School of Computer Science and Technology, University of Science and Technology of China 4i FLYTEK AI Research. Correspondence to: Enhong Chen <EMAIL>.
Pseudocode No The paper describes the Inquiry-Judge-Reference multi-agent system in Section 3 and provides detailed prompts for the agents in Appendix B, but it does not contain explicitly labeled pseudocode or algorithm blocks for the system's overall operation or the agents' internal logic.
Open Source Code Yes Our code and data are available at https://github. com/Ljyustc/Cog Math.
Open Datasets Yes We apply Cog Math on two of the most representative mathematical benchmarks, GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), along with our constructed MExam dataset.
Dataset Splits Yes For GSM8K and MATH, since their training sets may have already been used in the training process of current LLMs, we apply Cog Math on their public test sets, which contain 1,319 and 5,000 questions, respectively.
Hardware Specification No The paper mentions using LLMs such as GPT-4, GPT-3.5-Turbo, Gemini-1.5-Flash, Deep Seek-V2.5, Llama3-8B, Llama2-13B, and Mixtral-8x7B-Instruct, and that 'All the Inquiry agents, Reference agents, and Judge agents are implemented with GPT-4'. However, it does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments or for implementing the agents beyond mentioning the LLM models themselves.
Software Dependencies No The paper states that 'All the Inquiry agents, Reference agents, and Judge agents are implemented with GPT-4,' but does not provide specific software dependencies with version numbers for the overall experimental setup.
Experiment Setup Yes The maximum number of iterations for Inquiry agent is set to δ = 10. If after 10 iterations, we still fail to obtain a satisfactory inquiry, we consider the problem to be unsuitable to be evaluated from that dimension. For such problems, we omit consideration of that dimension during the evaluation. For ICL, we adopt a one-shot setting where, for each dimension i, we randomly sample a problem Pi from the training set and use Cog Math to construct an (inquiry qi P , answer ai P ) pair as the demonstration.