reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

Authors: Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V. Nori, Rahul Sharma, Amit Sharma, Javier Gonzalez

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the type of insights that RE-IMAGINE can generate on four widely-used benchmarks, which we use to evaluate reasoning on several families of LLMs. We observe reductions in performance when the models are queried with problem variations. ... In Sections 4-5, we use RE-IMAGINE to re-analyze the reasoning abilities of all models in the GPT (Brown et al., 2020), Llama (Touvron et al., 2023), and Phi families (Kambhampati et al., 2024). We focus on four reasoning benchmarks: GSM8K for math (Cobbe et al., 2021), CLadder for causality (Jin et al., 2023a), and CRUXEval (Gu et al., 2024) and Loop (Kamath et al., 2024). We show consistent decline in LLM performance as reasoning complexity increases across all evaluated benchmarks.
Researcher Affiliation	Industry	1Microsoft Research Cambridge, UK 2Microsoft Research India, India.
Pseudocode	No	The paper describes a benchmark synthesis pipeline that converts questions into symbolic (code) representation, applies mutations, and translates back to natural language. Figure 1 shows examples of Python code snippets representing problems. However, there are no explicit blocks labeled 'Pseudocode' or 'Algorithm' that describe the methodology itself in a structured, code-like format.
Open Source Code	No	The paper does not explicitly state that source code for the RE-IMAGINE framework or the experiments is being released, nor does it provide a link to a code repository.
Open Datasets	Yes	We focus on four reasoning benchmarks: GSM8K for math (Cobbe et al., 2021), CLadder for causality (Jin et al., 2023a), and CRUXEval (Gu et al., 2024) and Loop (Kamath et al., 2024).
Dataset Splits	No	For GSM8K, the paper mentions: "We construct our test set by filtering out examples where the Python solution execution does not match the ground-truth answers." For Bi-Counter Factual, it states: "we used 50 questions from the validation set of the benchmark". For CLadder: "We begin with a filtering pass of the 10,000 examples in CLadder, removing all examples corresponding to query types... This leaves 8,365 examples from CLadder after filtering." For Loop: "We evaluate on 250 tasks from Kamath et al. (2024)". While these describe data used for evaluation, specific, reproducible train/validation/test splits (e.g., percentages, exact counts for each split across all benchmarks) are not provided in a way that would allow for full reproduction of data partitioning.
Hardware Specification	No	The paper mentions evaluating models from the GPT, Llama, and Phi families. However, it does not specify any hardware details such as GPU models, CPU types, or other computing resources used to run the experiments.
Software Dependencies	No	The paper mentions using Python for code snippets and Frama-C with SMT solvers (Z3, alt-ergo, CVC4). However, it does not provide specific version numbers for Python, Frama-C, or any of the SMT solvers mentioned, which are necessary for reproducibility.
Experiment Setup	Yes	With 8-shot in-context examples, most of the models achieve high accuracy ( 95%) on the Level1 raw test set. ... During testing, all models are provided with 8 in-context examples with Chain-of-Thought (Co T) to help them understand the task (prompt shown in Figure 18 in Appendix E).