reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LeanAgent: Lifelong Learning for Formal Theorem Proving

Authors: Adarsh Kumarappan, Mohit Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, anima anandkumar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across 23 diverse Lean repositories demonstrate Lean Agent s advancements in lifelong learning for theorem proving. Lean Agent successfully generates formal proofs for 155 theorems across these 23 repositories where formal proofs were previously missing, known as sorry theorems, many from advanced mathematics. It performs significantly better than the static Re Prover baseline in terms of proving new sorry theorems. We conduct an extensive ablation study using six lifelong metrics carefully proposed or selected from the literature.
Researcher Affiliation	Academia	1California Institute of Technology, 2Stanford University, 3University of Wisconsin, Madison EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology in natural language and refers to a figure (Figure 1) for an overview, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Please see Appendix A.2 for experiment implementation details. We release Lean Agent at https://github.com/lean-dojo/Lean Agent.
Open Datasets	Yes	Although mathlib4 contains over 100,000 formalized mathematical theorems and definitions... We evaluate our approach on a diverse set of 23 Lean repositories to assess its generalizability across different mathematical domains (Skˇrivan, 2024; Kontorovich, 2024; Avigad, 2024; Tao et al., 2024; Renshaw, 2024; Fermat s Last Theorem, 2024; Deep Mind, 2024; Carneiro, 2024; Wieser, 2024; Mizuno, 2024; Murphy, 2024; Formal Logic, 2024; Con-nf, 2024; Gadgil, 2024; Yang, 2024; Zeta 3 Irrational, 2024; Firsching, Moritz, 2024; Monnerjahn, 2024; van Doorn, 2024; Dillies, 2024; Hairy Ball Theorem, 2024; Coxeter, 2024; Gattinger, 2024).
Dataset Splits	Yes	To generate the set of theorems and proofs, the default option is to simply use the theorems, proofs, premise files, and traced files from the current curriculum repository in the database. Specifically, we use the random split from Lean Dojo to create training, validation, and testing sets. We refrain from using the novel split from Lean Dojo, as we would like Lean Agent to learn as much as possible from a repository to perform well on its hardest theorems. The data in the splits include details about the proofs of theorems, including the URL and commit of the source repository, the file path of the theorem, the full name of the theorem, the theorem statement, the start and end positions in the source file, and a list of traced tactics with annotations. The validation and test split each contain 2% of the total theorem and proofs, following the methodology from Lean Dojo.
Hardware Specification	Yes	We use four NVIDIA A100 GPUs with 80GB of memory each for progressive training. Lean Agent uses a distributed architecture leveraging Py Torch Lightning and Ray for parallel processing.
Software Dependencies	No	The paper mentions using "Py Torch Lightning and Ray for parallel processing" and optimizing with "Adam W", but it does not specify version numbers for these software components, which is necessary for reproducibility.
Experiment Setup	Yes	We use bfloat16 mixed precision and optimize with Adam W (Loshchilov & Hutter, 2017) with an effective batch size of 16 (achieved through a batch size of 4 with gradient accumulation over 4 steps). In the first 1,000 steps, the learning rate warms up linearly from 0 to the maximum value of 10-3. Then it decays to 0 using a cosine schedule. In addition, we apply gradient clipping with a value of 1.0. Just as Re Prover does during training, we sample 3 negative premises per example, including 1 in-file negative premise. The maximum sequence length for the retriever is set to 1024 tokens. The maximum sequence length for the generator is set to 512 tokens for input and 128 tokens for output. The prover uses a best-first search strategy with no limit on the maximum number of expansions of the search tree. It generates 64 tactic candidates and retrieves 100 premises for each proof state. Lean Agent uses Re Prover s tactic generator for the experiments. We generate tactics with a beam search of size 5. We used 4 CPU workers, 1 per GPU. ... or reaches the time limit of 10 minutes.