reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Authors: Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our large-scale study on 25 state-of-the-art open and closed models provides significant insights into LLMs behavior in mathematical reasoning tasks. We question the reliability of currently reported results on GSM8K and demonstrate that the performance of LLMs can be viewed as a distribution with unwarranted variance across different instantiations of the same question.
Researcher Affiliation	Collaboration	1Apple 2Washington State University
Pseudocode	No	The paper describes methods and processes in narrative text and figures (like Figure 1 showing a template creation process), but it does not contain a clearly labeled section, figure, or block of pseudocode or a formal algorithm.
Open Source Code	Yes	GSM-Symbolic templates and generated data are available at: https://github.com/apple/ml-gsm-symbolic
Open Datasets	Yes	The GSM8K (Grade School Math 8K) dataset Cobbe et al. (2021) has emerged as a popular benchmark for evaluating the mathematical reasoning capabilities of LLMs.
Dataset Splits	Yes	The GSM8K dataset (Cobbe et al., 2021) includes over 8000 grade school math questions and answers, divided into 7473 training and 1319 test examples.
Hardware Specification	No	The paper discusses evaluating '25 state-of-the-art open and closed models' and conducting evaluations 'on various setups', but it does not provide any specific details about the hardware used to run these evaluations or train the models.
Software Dependencies	No	The paper mentions using 'Chain-of-Thought (CoT) prompting' and various language models (e.g., GPT-4o, Llama3), but it does not specify any software libraries, frameworks, or their version numbers that would be necessary to replicate the experimental environment.
Experiment Setup	Yes	Unless stated otherwise, we follow a common evaluation setup on GSM8K and other math benchmarks that includes Chain-of-Thought (Co T) prompting with 8-shots with greedy decoding. However, we note that in our preliminary experiments, the number of shots did not significantly change the performance and conclusions. We provide our prompt template in Fig. 9.