reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Alchemy: Amplifying Theorem-Proving Capability Through Symbolic Mutation

Authors: Shaonan Wu, Shuai Lu, Yeyun Gong, Nan Duan, Ping Wei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the effectiveness of our approach, achieving a 4.70% absolute performance improvement on Leandojo benchmark. Additionally, our approach achieves a 2.47% absolute performance gain on the out-of-distribution mini F2F benchmark based on the synthetic data.
Researcher Affiliation	Collaboration	Shaonan Wu 1,2, Shuai Lu 3, Yeyun Gong 3, Nan Duan 3, Ping Wei 1,2, 1 National Key Laboratory of Human-Machine Hybrid Augmented Intelligence 2 Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University 3 Microsoft Research Asia {shaonanwu@stu.,pingwei@}xjtu.edu.cn, EMAIL
Pseudocode	Yes	Algorithm 1 Find invocable theorems Algorithm 2 Construct new statement for rw tactic Algorithm 3 Construct new statement for apply tactic
Open Source Code	Yes	1 The code is available at https://github.com/wclsn/Alchemy.
Open Datasets	Yes	Building upon the advanced Lean theorem prover (de Moura et al., 2015), we introduce a general method that synthesizes theorems directly in symbolic space. ... with the mathematical library of Lean Mathlib2 as seed data... We demonstrate the effectiveness of our method by evaluating the theorem-proving capability of these provers on the challenging Leandojo benchmark (Yang et al., 2023). Our synthetic data improve the performance by 4.70% (over 70 theorems) on the novel premises split. Furthermore, the synthesized data exhibit promise in enhancing the outof-distribution theorem-proving ability of LLMs, as evidenced by a performance increase of about 2.47% on the competition-level mini F2F benchmark (Zheng et al., 2022).
Dataset Splits	Yes	We evaluate the theorem prover on two benchmarks that are widely adopted by the research community: 1) Leandojo Benchmark (Yang et al., 2023), which shares the same distributional characteristics as the seed theorems; 2) mini F2F (Zheng et al., 2022)... We remove the subsets of theorems for both splits that can not be initialized by Leandojo. There remain 1,929 theorems in random split and 1,659 theorems in novel premises split.
Hardware Specification	Yes	All experiments are conducted on 8 H100 GPUS. We employ a linear learning rate scheduler with a 3% warm-up period and a maximum learning rate of 2e-5. We set the global batch size to 256 and the cutoff length to 2,048. All models are trained using Deepspeed Ze RO Stage3 (Rajbhandari et0al., 2021) and Flash-Attention 2 (Dao, 2024). ... The experiments run on a large collection of CPUs (512 8-core for the rw tactic and 256 8-core for apply). The substantial CPU requirement is largely due to the memory-intensive nature of Leandojo, which hinders multiprocessing on a single node. ... 7512 CPU nodes, each node has 8 cores and 56GB RAM
Software Dependencies	Yes	We choose Mathlib44 which contains around 110k theorems as the seed data for data-synthesis. Our synthesis pipeline is built upon Leandojo5 (Yang et al., 2023), a Python module... We choose 1 32 100 as our search setting. The evaluation script is modified from an open-source implementation (Welleck, 2023) which is based on v LLM (Kwon et al., 2023) and Leandojo (Yang et al., 2023). We utilize Leandojo Benchmark (Yang et al., 2023)... We upgrade the tool-chain version of mini F2F (Zheng et al., 2022) to v4.6.0 rc1. ... We select Llama-3-8B (Dubey et al., 2024) and deepseek-coder-base-v1.57B (Guo et al., 2024) as our base models. ... All models are trained using Deepspeed Ze RO Stage3 (Rajbhandari et al., 2021) and Flash-Attention 2 (Dao, 2024). We utilize the open-sourced codebase Llama-Factory (Zheng et al., 2024) for all training experiments.
Experiment Setup	Yes	We conduct continual pretraining with the next-token prediction objective for one epoch. Then we fine-tune the models with the proofstep prediction objective (Polu & Sutskever, 2020) for two epochs. All experiments are conducted on 8 H100 GPUS. We employ a linear learning rate scheduler with a 3% warm-up period and a maximum learning rate of 2e-5. We set the global batch size to 256 and the cutoff length to 2,048. ... We use bestfirst-search as our search tragedy with a 10-minute timeout. The search budget is represented as attempt sample step. Here attempt denotes the number of attempts, sample denotes the number of generated tactics per iteration, and step denotes the maximum number of steps per attempt. We choose 1 32 100 as our search setting.