RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Authors: Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V. Nori, Rahul Sharma, Amit Sharma, Javier Gonzalez
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the type of insights that RE-IMAGINE can generate on four widely-used benchmarks, which we use to evaluate reasoning on several families of LLMs. We observe reductions in performance when the models are queried with problem variations. ... In Sections 4-5, we use RE-IMAGINE to re-analyze the reasoning abilities of all models in the GPT (Brown et al., 2020), Llama (Touvron et al., 2023), and Phi families (Kambhampati et al., 2024). We focus on four reasoning benchmarks: GSM8K for math (Cobbe et al., 2021), CLadder for causality (Jin et al., 2023a), and CRUXEval (Gu et al., 2024) and Loop (Kamath et al., 2024). We show consistent decline in LLM performance as reasoning complexity increases across all evaluated benchmarks. |
| Researcher Affiliation | Industry | 1Microsoft Research Cambridge, UK 2Microsoft Research India, India. |
| Pseudocode | No | The paper describes a benchmark synthesis pipeline that converts questions into symbolic (code) representation, applies mutations, and translates back to natural language. Figure 1 shows examples of Python code snippets representing problems. However, there are no explicit blocks labeled 'Pseudocode' or 'Algorithm' that describe the methodology itself in a structured, code-like format. |
| Open Source Code | No | The paper does not explicitly state that source code for the RE-IMAGINE framework or the experiments is being released, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We focus on four reasoning benchmarks: GSM8K for math (Cobbe et al., 2021), CLadder for causality (Jin et al., 2023a), and CRUXEval (Gu et al., 2024) and Loop (Kamath et al., 2024). |
| Dataset Splits | No | For GSM8K, the paper mentions: "We construct our test set by filtering out examples where the Python solution execution does not match the ground-truth answers." For Bi-Counter Factual, it states: "we used 50 questions from the validation set of the benchmark". For CLadder: "We begin with a filtering pass of the 10,000 examples in CLadder, removing all examples corresponding to query types... This leaves 8,365 examples from CLadder after filtering." For Loop: "We evaluate on 250 tasks from Kamath et al. (2024)". While these describe data used for evaluation, specific, reproducible train/validation/test splits (e.g., percentages, exact counts for each split across all benchmarks) are not provided in a way that would allow for full reproduction of data partitioning. |
| Hardware Specification | No | The paper mentions evaluating models from the GPT, Llama, and Phi families. However, it does not specify any hardware details such as GPU models, CPU types, or other computing resources used to run the experiments. |
| Software Dependencies | No | The paper mentions using Python for code snippets and Frama-C with SMT solvers (Z3, alt-ergo, CVC4). However, it does not provide specific version numbers for Python, Frama-C, or any of the SMT solvers mentioned, which are necessary for reproducibility. |
| Experiment Setup | Yes | With 8-shot in-context examples, most of the models achieve high accuracy ( 95%) on the Level1 raw test set. ... During testing, all models are provided with 8 in-context examples with Chain-of-Thought (Co T) to help them understand the task (prompt shown in Figure 18 in Appendix E). |