GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Authors: Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our large-scale study on 25 state-of-the-art open and closed models provides significant insights into LLMs behavior in mathematical reasoning tasks. We question the reliability of currently reported results on GSM8K and demonstrate that the performance of LLMs can be viewed as a distribution with unwarranted variance across different instantiations of the same question.
Researcher Affiliation Collaboration 1Apple 2Washington State University
Pseudocode No The paper describes methods and processes in narrative text and figures (like Figure 1 showing a template creation process), but it does not contain a clearly labeled section, figure, or block of pseudocode or a formal algorithm.
Open Source Code Yes GSM-Symbolic templates and generated data are available at: https://github.com/apple/ml-gsm-symbolic
Open Datasets Yes The GSM8K (Grade School Math 8K) dataset Cobbe et al. (2021) has emerged as a popular benchmark for evaluating the mathematical reasoning capabilities of LLMs.
Dataset Splits Yes The GSM8K dataset (Cobbe et al., 2021) includes over 8000 grade school math questions and answers, divided into 7473 training and 1319 test examples.
Hardware Specification No The paper discusses evaluating '25 state-of-the-art open and closed models' and conducting evaluations 'on various setups', but it does not provide any specific details about the hardware used to run these evaluations or train the models.
Software Dependencies No The paper mentions using 'Chain-of-Thought (CoT) prompting' and various language models (e.g., GPT-4o, Llama3), but it does not specify any software libraries, frameworks, or their version numbers that would be necessary to replicate the experimental environment.
Experiment Setup Yes Unless stated otherwise, we follow a common evaluation setup on GSM8K and other math benchmarks that includes Chain-of-Thought (Co T) prompting with 8-shots with greedy decoding. However, we note that in our preliminary experiments, the number of shots did not significantly change the performance and conclusions. We provide our prompt template in Fig. 9.