GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Authors: Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our large-scale study on 25 state-of-the-art open and closed models provides significant insights into LLMs behavior in mathematical reasoning tasks. We question the reliability of currently reported results on GSM8K and demonstrate that the performance of LLMs can be viewed as a distribution with unwarranted variance across different instantiations of the same question. |
| Researcher Affiliation | Collaboration | 1Apple 2Washington State University |
| Pseudocode | No | The paper describes methods and processes in narrative text and figures (like Figure 1 showing a template creation process), but it does not contain a clearly labeled section, figure, or block of pseudocode or a formal algorithm. |
| Open Source Code | Yes | GSM-Symbolic templates and generated data are available at: https://github.com/apple/ml-gsm-symbolic |
| Open Datasets | Yes | The GSM8K (Grade School Math 8K) dataset Cobbe et al. (2021) has emerged as a popular benchmark for evaluating the mathematical reasoning capabilities of LLMs. |
| Dataset Splits | Yes | The GSM8K dataset (Cobbe et al., 2021) includes over 8000 grade school math questions and answers, divided into 7473 training and 1319 test examples. |
| Hardware Specification | No | The paper discusses evaluating '25 state-of-the-art open and closed models' and conducting evaluations 'on various setups', but it does not provide any specific details about the hardware used to run these evaluations or train the models. |
| Software Dependencies | No | The paper mentions using 'Chain-of-Thought (CoT) prompting' and various language models (e.g., GPT-4o, Llama3), but it does not specify any software libraries, frameworks, or their version numbers that would be necessary to replicate the experimental environment. |
| Experiment Setup | Yes | Unless stated otherwise, we follow a common evaluation setup on GSM8K and other math benchmarks that includes Chain-of-Thought (Co T) prompting with 8-shots with greedy decoding. However, we note that in our preliminary experiments, the number of shots did not significantly change the performance and conclusions. We provide our prompt template in Fig. 9. |