reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ImProver: Agent-Based Automated Proof Optimization

Authors: Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test Im Prover on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that Im Prover is capable of rewriting proofs so that they are substantially shorter and more declarative in structure. [...] Our experimentation is split into three distinct stages. We first perform ablation testing on the Im Prover model parameters ( 3.2) to ensure that Im Prover s parameter specification is the optimal one with respect to correctness and metric optimization score. We then evaluate this optimal parameter combination on datasets of varying complexity and analyze the performance and results thereof.
Researcher Affiliation	Academia	Riyaz Ahuja Jeremy Avigad Prasad Tetali Sean Welleck Carnegie Mellon University EMAIL
Pseudocode	No	The paper describes the methods, including Chain-of-States prompting, output formatting, sampling methods, and retrieval, in narrative text and figures, but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	1Code is available at https://github.com/riyazahuja/Im Prover.
Open Datasets	Yes	We evaluate Im Prover on subsets of the Mathematics in Lean (MIL) (leanprover-community, 2024), Compfiles (David Renshaw, 2024), and Mathlib (mathlib Community, 2020) datasets. Additionally, ablations are performed on a subset of MIL, and theorem proving is benchmarked on various subsets of MIL as well as the Mini F2F (Zheng et al., 2022) dataset.
Dataset Splits	No	The paper mentions evaluating on 'subsets' and 'representative samples' of datasets for experiments and ablations, and describes specific exercises for NTP evaluation (e.g., '12 exercises in group theory (Chapter 8; denoted MIL-C08 , done at 15 samples), 11 exercises in set theory (Chapter 4; denoted MIL-C04 , done at 15 samples)'). However, it does not provide explicit training/validation/test splits (percentages, counts, or methodology) that would be needed to reproduce the data partitioning for model development or evaluation across all experiments.
Hardware Specification	No	The paper mentions using 'GPT-4o (Open AI et al., 2024) (gpt-4o-2024-08-06)' and 'GPT-4o-mini' as base generators, which are language models. However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which these models were run or their experiments were conducted.
Software Dependencies	No	The paper mentions 'Lean 4' as the formal language used and 'GPT-4o' and 'GPT-4o-mini' as the language models. It also refers to 'Lean s Mathlib' and 'Lean s rich metaprogramming suite'. However, it does not provide specific version numbers for Lean, Mathlib, or any other critical software libraries or dependencies that would be needed for replication.
Experiment Setup	Yes	Using a temperature value of 1, we ensure that our n calls to the model are diverse, as the temperature hyperparameter (ranging between 0 and 2) controls the randomness of the outputs. [...] The refinement process relies on user-defined parameters n and prev_num to specify the number of iterations and the number of previous iterations data to forward, respectively. Each iteration carries information on the last prev_num iterations, including input, output, metric score, correctness, and error messages. [...] We test n = 3, 5, 7, 10, 15 on GPT-4o and GPT-4o-mini, as well as n = 20 for GPT-4o-mini (as it is of a far lower token cost).