MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Authors: Andreas Opedal, Haruki Shirakami, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using Math GAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy. |
| Researcher Affiliation | Academia | Andreas Opedal1,2, Haruki Shirakami1,3, Bernhard Sch olkopf1,2 Abulhair Saparov4 Mrinmaya Sachan1 1ETH Z urich 2Max Planck Institute for Intelligent Systems, T ubingen 3Idiap Research Institute 4Purdue University EMAIL EMAIL |
| Pseudocode | No | The paper describes the 'GENERATION METHOD' in Section 4.1 and illustrates it with a diagram in Figure 1, but it does not contain a formal pseudocode block or algorithm. |
| Open Source Code | Yes | https://github.com/eth-lre/mathgap-experiments ... Code to generate problems with Math GAP, including problem types beyond those considered in this paper, can be found in our public code repository. |
| Open Datasets | No | In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called Math GAP. Math GAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure... Math GAP mitigates data contamination by creating new, more complex synthetic test sets. |
| Dataset Splits | No | We generate multiple test sets of different degrees of complexity with 400 problems in each. We then generate model predictions for the problems in these test sets under four different prompts... For prompts (ii) through (iv) we generate a new set of in-context examples for every test problem. |
| Hardware Specification | No | The paper evaluates Mixtral-8x7B (Jiang et al., 2024a), Llama3-8B, Llama3-70B (Llama Team, 2024), GPT-3.5-Turbo, GPT-4o (Open AI, 2024), and a few additional evaluations on o1-preview and Deep Seek-R1 (Deep Seek-AI, 2025) but does not specify the hardware used for these experiments. |
| Software Dependencies | No | The paper refers to specific LLM model IDs like 'gpt-3.5-turbo-0125' and 'gpt-4o-2024-05-13' but does not provide details on other software dependencies (e.g., programming languages, libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | Responses are generated using greedy decoding and a maximum context length of 4,096 tokens. Model predictions are obtained by extracting the last number occurring in the model output. We set the number of examples to 12, except for the experiments on nonlinear problems for which we use 5... For GPT-3.5 Turbo and GPT-4o we used gpt-3.5-turbo-0125 and gpt-4o-2024-05-13 as the model ids for all experiments. |