ImProver: Agent-Based Automated Proof Optimization

Authors: Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test Im Prover on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that Im Prover is capable of rewriting proofs so that they are substantially shorter and more declarative in structure. [...] Our experimentation is split into three distinct stages. We first perform ablation testing on the Im Prover model parameters ( 3.2) to ensure that Im Prover s parameter specification is the optimal one with respect to correctness and metric optimization score. We then evaluate this optimal parameter combination on datasets of varying complexity and analyze the performance and results thereof.
Researcher Affiliation Academia Riyaz Ahuja Jeremy Avigad Prasad Tetali Sean Welleck Carnegie Mellon University EMAIL
Pseudocode No The paper describes the methods, including Chain-of-States prompting, output formatting, sampling methods, and retrieval, in narrative text and figures, but does not present any formal pseudocode or algorithm blocks.
Open Source Code Yes 1Code is available at https://github.com/riyazahuja/Im Prover.
Open Datasets Yes We evaluate Im Prover on subsets of the Mathematics in Lean (MIL) (leanprover-community, 2024), Compfiles (David Renshaw, 2024), and Mathlib (mathlib Community, 2020) datasets. Additionally, ablations are performed on a subset of MIL, and theorem proving is benchmarked on various subsets of MIL as well as the Mini F2F (Zheng et al., 2022) dataset.
Dataset Splits No The paper mentions evaluating on 'subsets' and 'representative samples' of datasets for experiments and ablations, and describes specific exercises for NTP evaluation (e.g., '12 exercises in group theory (Chapter 8; denoted MIL-C08 , done at 15 samples), 11 exercises in set theory (Chapter 4; denoted MIL-C04 , done at 15 samples)'). However, it does not provide explicit training/validation/test splits (percentages, counts, or methodology) that would be needed to reproduce the data partitioning for model development or evaluation across all experiments.
Hardware Specification No The paper mentions using 'GPT-4o (Open AI et al., 2024) (gpt-4o-2024-08-06)' and 'GPT-4o-mini' as base generators, which are language models. However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which these models were run or their experiments were conducted.
Software Dependencies No The paper mentions 'Lean 4' as the formal language used and 'GPT-4o' and 'GPT-4o-mini' as the language models. It also refers to 'Lean s Mathlib' and 'Lean s rich metaprogramming suite'. However, it does not provide specific version numbers for Lean, Mathlib, or any other critical software libraries or dependencies that would be needed for replication.
Experiment Setup Yes Using a temperature value of 1, we ensure that our n calls to the model are diverse, as the temperature hyperparameter (ranging between 0 and 2) controls the randomness of the outputs. [...] The refinement process relies on user-defined parameters n and prev_num to specify the number of iterations and the number of previous iterations data to forward, respectively. Each iteration carries information on the last prev_num iterations, including input, output, metric score, correctness, and error messages. [...] We test n = 3, 5, 7, 10, 15 on GPT-4o and GPT-4o-mini, as well as n = 20 for GPT-4o-mini (as it is of a far lower token cost).