reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Herald: A Natural Language Annotated Lean 4 Dataset

Authors: Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, Bin Dong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Herald translator achieves a 96.7% accuracy (Pass@128) on formalizing statements in the mini F2F-test and a 23.5% accuracy on our internal graduatelevel textbook dataset, outperforming Intern LM2-Math-Plus-7B (73.0% and 7.5%) and Theorem Llama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, are open-sourced to the public.
Researcher Affiliation	Academia	1Peking University 2National University of Singapore 3Center for Data Science, Peking University 4Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University 5Center for Machine Learning Research, Peking University EMAIL, EMAIL EMAIL, EMAIL EMAIL, EMAIL EMAIL
Pseudocode	No	The paper describes its methodology in prose and through diagrams (Figure 1), but does not include any explicitly labeled pseudocode or algorithm blocks. The Lean code examples in the appendix are actual code demonstrations, not pseudocode for the described methodology.
Open Source Code	Yes	Our model, along with the datasets, are open-sourced to the public.1 The model can be found at https://huggingface.co/Frenzy Math/Herald_translator. The code of Herald and Lean Search can be found at https://github.com/frenzymath/herald_ translator.
Open Datasets	Yes	Our model, along with the datasets, are open-sourced to the public.1...We present the Herald dataset, generated from our pipeline on Mathlib4, containing 580k valid statements and 44k NL-FL theorem pairs. ...mini F2F (Zheng et al., 2022) A widely-used benchmark dataset for formal mathematics.
Dataset Splits	Yes	Model mini F2F Extract Theorem College Co T test valid Theorem Llama... The last two datasets (Extract Theorem and College Co T) are shuffled subsets of 200 samples each. ... The Proof Net dataset... We perform 128 parallel translations of the Proof Net validation set (with 185 natural language statements)...
Hardware Specification	No	The paper mentions using specific models like Deep Seek-Prover-V1.5-Base 7B and Deep Seek Prover 1.5 for its experiments and describes training configurations (e.g., "16-pass setting", "4 512 sample budget"), but it does not provide specific hardware details such as GPU or CPU models used for these experiments.
Software Dependencies	No	The paper mentions "Deep Seek-Prover-V1.5-Base 7B" as the base model used in its experiments, but it does not provide specific version numbers for programming languages or other software libraries and dependencies used in the implementation.
Experiment Setup	Yes	Our fine-tuning process consisted of two stages: first, we conducted a 2000-step warm-up using the Open Hermes2.5 dataset, followed by training on the mixed dataset. We used a learning rate of 4e-5 with a cosine decay schedule across 5 training epochs. ... The generation was completed efficiently using a 16-pass setting... For prover configuration, we use Deep Seek-Prover-V1.5-RL + RMax TS with 4 512 sample budget