reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Authors: Andreas Opedal, Haruki Shirakami, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using Math GAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.
Researcher Affiliation	Academia	Andreas Opedal1,2, Haruki Shirakami1,3, Bernhard Sch olkopf1,2 Abulhair Saparov4 Mrinmaya Sachan1 1ETH Z urich 2Max Planck Institute for Intelligent Systems, T ubingen 3Idiap Research Institute 4Purdue University EMAIL EMAIL
Pseudocode	No	The paper describes the 'GENERATION METHOD' in Section 4.1 and illustrates it with a diagram in Figure 1, but it does not contain a formal pseudocode block or algorithm.
Open Source Code	Yes	https://github.com/eth-lre/mathgap-experiments ... Code to generate problems with Math GAP, including problem types beyond those considered in this paper, can be found in our public code repository.
Open Datasets	No	In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called Math GAP. Math GAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure... Math GAP mitigates data contamination by creating new, more complex synthetic test sets.
Dataset Splits	No	We generate multiple test sets of different degrees of complexity with 400 problems in each. We then generate model predictions for the problems in these test sets under four different prompts... For prompts (ii) through (iv) we generate a new set of in-context examples for every test problem.
Hardware Specification	No	The paper evaluates Mixtral-8x7B (Jiang et al., 2024a), Llama3-8B, Llama3-70B (Llama Team, 2024), GPT-3.5-Turbo, GPT-4o (Open AI, 2024), and a few additional evaluations on o1-preview and Deep Seek-R1 (Deep Seek-AI, 2025) but does not specify the hardware used for these experiments.
Software Dependencies	No	The paper refers to specific LLM model IDs like 'gpt-3.5-turbo-0125' and 'gpt-4o-2024-05-13' but does not provide details on other software dependencies (e.g., programming languages, libraries, frameworks) with version numbers.
Experiment Setup	Yes	Responses are generated using greedy decoding and a maximum context length of 4,096 tokens. Model predictions are obtained by extracting the last number occurring in the model output. We set the number of examples to 12, except for the experiments on nonlinear problems for which we use 5... For GPT-3.5 Turbo and GPT-4o we used gpt-3.5-turbo-0125 and gpt-4o-2024-05-13 as the model ids for all experiments.