reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis

Authors: Haoxiong Liu, Jiacheng Sun, Zhenguo Li, Andrew C Yao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The superiority of our method is validated on the mini F2F benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version) with at most 2100 queries to the model per problem
Researcher Affiliation	Collaboration	1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Huawei Noah s Ark Lab 3Shanghai Qi Zhi Institute. Correspondence to: Haoxiong Liu <EMAIL>, Zhenguo Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 Find the Maximal Compatible Semi-Proof
Open Source Code	Yes	Our code is available at https://github.com/haoxiongliu/Proof Aug.
Open Datasets	Yes	The superiority of our method is validated on the mini F2F benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. On mini F2F-test5 (Zheng et al., 2021), Proof Aug achieves a pass rate of 52.5% with no more than 100 queries to the model. You can find our curated version here: https://github.com/haoxiongliu/Proof Aug/blob/main/isabelle_src/datasets/minif2f-test-curated.jsonl.
Dataset Splits	Yes	We evaluate our methods on the Isabelle part of mini F2F (Zheng et al., 2021), a benchmark for formal Olympiad-level mathematics problems across four different formal languages. Unless otherwise specified, we use mini F2F-test or the original version of mini F2F-test or the DSP-version mini F2F-test to refer to the dataset included in the code repository8 of Jiang et al. (2023)...So, we build a curated dataset and part of our experiments are done on the curated version.10 On mini F2F-valid, Proof Aug w/ and w/o ERP module achieve 58.6% and 57.0% in 100 attempts of proof respectively.
Hardware Specification	Yes	During verification, we run 12 PISA instances in parallel on one Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz11. The language model we use across all our experiments is deepseek-math-7b-base (Shao et al., 2024), run on Nvidia Ge Force RTX 3090 with v LLM (Kwon et al., 2023).
Software Dependencies	Yes	The Isabelle version we use is Isabelle2022. We use the PISA environment (Jiang et al., 2021) to interact with Isabelle. The language model we use across all our experiments is deepseek-math-7b-base (Shao et al., 2024), run on Nvidia Ge Force RTX 3090 with v LLM (Kwon et al., 2023).
Experiment Setup	Yes	Following Jiang et al. (2023), we set the timeout for any proof step and Sledgehammer as 10s and 120s, respectively. Unless otherwise specified, we follow Jiang et al. (2023) to use a sampling temperature T = 0.6 with top p = 0.95. The default max number of tokens of the response is set to 2048, and when the length of the prompt exceeds 2048, we adjust the max number of tokens to 4096 #token of the prompt.