Lean-STaR: Learning to Interleave Thinking and Proving
Authors: Haohan Lin, Zhiqing Sun, Sean Welleck, Yiming Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lean-STa R significantly outperforming base models (43.4% 46.3%, Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness. ... Our main results are reported in Table 1. |
| Researcher Affiliation | Academia | 1Language Technologies Institute, Carnegie Mellon University 2Institute for Interdisciplinary Information Sciences, Tsinghua University EMAIL |
| Pseudocode | No | The paper describes methods in prose and with diagrams (Figure 3 is a pipeline diagram) but does not contain any explicit pseudocode blocks or algorithms. |
| Open Source Code | No | The paper mentions "https://leanstar.github.io/" but does not contain an explicit statement confirming the release of the code for the described methodology or a direct link to a code repository. The provided URL points to a project homepage rather than a specific code repository. |
| Open Datasets | Yes | We instantiate Lean-STa R using the best available open language model pre-trained on the Lean corpus (Intern LM2-Math-base-7b (Ying et al., 2024)), and follow standard practice in using Lean s Mathlib as the underlying training set (via the Lean Dojo dataset (Yang et al., 2023)). We generate an initial set of thoughts for Mathlib using GPT-4, perform two rounds of expert iteration, then evaluate the model on mini F2F (Zheng et al., 2021) and leandojo (Yang et al., 2023), the de-facto standard benchmark for evaluating language-model based theorem provers. |
| Dataset Splits | Yes | We evaluate our method on the Mini F2F benchmark (Zheng et al., 2021). We use a similar evaluation setting as previous works (Yang et al., 2023; Welleck & Saha, 2023; Ying et al., 2024), but use our sampling method instead of best-first search for the evaluation of our thought-augmented theorem proving model as discussed in ( 3.3). ... We use a random subset of Leandojo4-v9-test (novel premises) with a size of 320 as test set of leandojo. |
| Hardware Specification | Yes | We collected 32, 231 different (proof state, thoughts, next-tactic) pairs in successful proof trajectories during expert iteration, which takes about 4 days with 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions specific language models and theorem provers (e.g., Lean, GPT-4-0125, Intern LM2-Math-base-7b) but does not provide a comprehensive list of other software components like programming languages, libraries, or operating systems with their specific version numbers required for replication. |
| Experiment Setup | Yes | We fine-tune for 1 epoch to obtain the SFT model. For the learning rate, we use a warmup in the first 20% steps from 0 to 2 10 5, followed by a cosine schedule decaying to zero. ... In each iteration we use sampling on the Lean Dojo Benchmark 4 dataset, and save the (state, thought, tactic) examples that are part of successful proofs. For each problem, we sample K = 32 times in parallel with temperature T = 1.0, and limit the number of times a tactic can be generated to a total of N = 5 per problem. Also, sampling is limited to 1 minute per problem. |