ImProver: Agent-Based Automated Proof Optimization
Authors: Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test Im Prover on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that Im Prover is capable of rewriting proofs so that they are substantially shorter and more declarative in structure. [...] Our experimentation is split into three distinct stages. We first perform ablation testing on the Im Prover model parameters ( 3.2) to ensure that Im Prover s parameter specification is the optimal one with respect to correctness and metric optimization score. We then evaluate this optimal parameter combination on datasets of varying complexity and analyze the performance and results thereof. |
| Researcher Affiliation | Academia | Riyaz Ahuja Jeremy Avigad Prasad Tetali Sean Welleck Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper describes the methods, including Chain-of-States prompting, output formatting, sampling methods, and retrieval, in narrative text and figures, but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code is available at https://github.com/riyazahuja/Im Prover. |
| Open Datasets | Yes | We evaluate Im Prover on subsets of the Mathematics in Lean (MIL) (leanprover-community, 2024), Compfiles (David Renshaw, 2024), and Mathlib (mathlib Community, 2020) datasets. Additionally, ablations are performed on a subset of MIL, and theorem proving is benchmarked on various subsets of MIL as well as the Mini F2F (Zheng et al., 2022) dataset. |
| Dataset Splits | No | The paper mentions evaluating on 'subsets' and 'representative samples' of datasets for experiments and ablations, and describes specific exercises for NTP evaluation (e.g., '12 exercises in group theory (Chapter 8; denoted MIL-C08 , done at 15 samples), 11 exercises in set theory (Chapter 4; denoted MIL-C04 , done at 15 samples)'). However, it does not provide explicit training/validation/test splits (percentages, counts, or methodology) that would be needed to reproduce the data partitioning for model development or evaluation across all experiments. |
| Hardware Specification | No | The paper mentions using 'GPT-4o (Open AI et al., 2024) (gpt-4o-2024-08-06)' and 'GPT-4o-mini' as base generators, which are language models. However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which these models were run or their experiments were conducted. |
| Software Dependencies | No | The paper mentions 'Lean 4' as the formal language used and 'GPT-4o' and 'GPT-4o-mini' as the language models. It also refers to 'Lean s Mathlib' and 'Lean s rich metaprogramming suite'. However, it does not provide specific version numbers for Lean, Mathlib, or any other critical software libraries or dependencies that would be needed for replication. |
| Experiment Setup | Yes | Using a temperature value of 1, we ensure that our n calls to the model are diverse, as the temperature hyperparameter (ranging between 0 and 2) controls the randomness of the outputs. [...] The refinement process relies on user-defined parameters n and prev_num to specify the number of iterations and the number of previous iterations data to forward, respectively. Each iteration carries information on the last prev_num iterations, including input, output, metric score, correctness, and error messages. [...] We test n = 3, 5, 7, 10, 15 on GPT-4o and GPT-4o-mini, as well as n = 20 for GPT-4o-mini (as it is of a far lower token cost). |