Lean-STaR: Learning to Interleave Thinking and Proving

Authors: Haohan Lin, Zhiqing Sun, Sean Welleck, Yiming Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lean-STa R significantly outperforming base models (43.4% 46.3%, Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness. ... Our main results are reported in Table 1.
Researcher Affiliation Academia 1Language Technologies Institute, Carnegie Mellon University 2Institute for Interdisciplinary Information Sciences, Tsinghua University EMAIL
Pseudocode No The paper describes methods in prose and with diagrams (Figure 3 is a pipeline diagram) but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code No The paper mentions "https://leanstar.github.io/" but does not contain an explicit statement confirming the release of the code for the described methodology or a direct link to a code repository. The provided URL points to a project homepage rather than a specific code repository.
Open Datasets Yes We instantiate Lean-STa R using the best available open language model pre-trained on the Lean corpus (Intern LM2-Math-base-7b (Ying et al., 2024)), and follow standard practice in using Lean s Mathlib as the underlying training set (via the Lean Dojo dataset (Yang et al., 2023)). We generate an initial set of thoughts for Mathlib using GPT-4, perform two rounds of expert iteration, then evaluate the model on mini F2F (Zheng et al., 2021) and leandojo (Yang et al., 2023), the de-facto standard benchmark for evaluating language-model based theorem provers.
Dataset Splits Yes We evaluate our method on the Mini F2F benchmark (Zheng et al., 2021). We use a similar evaluation setting as previous works (Yang et al., 2023; Welleck & Saha, 2023; Ying et al., 2024), but use our sampling method instead of best-first search for the evaluation of our thought-augmented theorem proving model as discussed in ( 3.3). ... We use a random subset of Leandojo4-v9-test (novel premises) with a size of 320 as test set of leandojo.
Hardware Specification Yes We collected 32, 231 different (proof state, thoughts, next-tactic) pairs in successful proof trajectories during expert iteration, which takes about 4 days with 8 A100 GPUs.
Software Dependencies No The paper mentions specific language models and theorem provers (e.g., Lean, GPT-4-0125, Intern LM2-Math-base-7b) but does not provide a comprehensive list of other software components like programming languages, libraries, or operating systems with their specific version numbers required for replication.
Experiment Setup Yes We fine-tune for 1 epoch to obtain the SFT model. For the learning rate, we use a warmup in the first 20% steps from 0 to 2 10 5, followed by a cosine schedule decaying to zero. ... In each iteration we use sampling on the Lean Dojo Benchmark 4 dataset, and save the (state, thought, tactic) examples that are part of successful proofs. For each problem, we sample K = 32 times in parallel with temperature T = 1.0, and limit the number of times a tactic can be generated to a total of N = 5 per problem. Also, sampling is limited to 1 minute per problem.