reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CogMath: Assessing LLMs’ Authentic Mathematical Ability from a Human Cognitive Perspective

Authors: Jiayu Liu, Zhenya Huang, Wei Dai, Cheng Cheng, Jinze Wu, Jing Sha, Song Li, Qi Liu, Shijin Wang, Enhong Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By applying Cog Math on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30%-40%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
Researcher Affiliation	Collaboration	1School of Artificial Intelligence and Data Science, University of Science and Technology of China 2State Key Laboratory of Cognitive Intelligence 3School of Computer Science and Technology, University of Science and Technology of China 4i FLYTEK AI Research. Correspondence to: Enhong Chen <EMAIL>.
Pseudocode	No	The paper describes the Inquiry-Judge-Reference multi-agent system in Section 3 and provides detailed prompts for the agents in Appendix B, but it does not contain explicitly labeled pseudocode or algorithm blocks for the system's overall operation or the agents' internal logic.
Open Source Code	Yes	Our code and data are available at https://github. com/Ljyustc/Cog Math.
Open Datasets	Yes	We apply Cog Math on two of the most representative mathematical benchmarks, GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), along with our constructed MExam dataset.
Dataset Splits	Yes	For GSM8K and MATH, since their training sets may have already been used in the training process of current LLMs, we apply Cog Math on their public test sets, which contain 1,319 and 5,000 questions, respectively.
Hardware Specification	No	The paper mentions using LLMs such as GPT-4, GPT-3.5-Turbo, Gemini-1.5-Flash, Deep Seek-V2.5, Llama3-8B, Llama2-13B, and Mixtral-8x7B-Instruct, and that 'All the Inquiry agents, Reference agents, and Judge agents are implemented with GPT-4'. However, it does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments or for implementing the agents beyond mentioning the LLM models themselves.
Software Dependencies	No	The paper states that 'All the Inquiry agents, Reference agents, and Judge agents are implemented with GPT-4,' but does not provide specific software dependencies with version numbers for the overall experimental setup.
Experiment Setup	Yes	The maximum number of iterations for Inquiry agent is set to δ = 10. If after 10 iterations, we still fail to obtain a satisfactory inquiry, we consider the problem to be unsuitable to be evaluated from that dimension. For such problems, we omit consideration of that dimension during the evaluation. For ICL, we adopt a one-shot setting where, for each dimension i, we randomly sample a problem Pi from the training set and use Cog Math to construct an (inquiry qi P , answer ai P ) pair as the demonstration.