Herald: A Natural Language Annotated Lean 4 Dataset
Authors: Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, Bin Dong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Herald translator achieves a 96.7% accuracy (Pass@128) on formalizing statements in the mini F2F-test and a 23.5% accuracy on our internal graduatelevel textbook dataset, outperforming Intern LM2-Math-Plus-7B (73.0% and 7.5%) and Theorem Llama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, are open-sourced to the public. |
| Researcher Affiliation | Academia | 1Peking University 2National University of Singapore 3Center for Data Science, Peking University 4Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University 5Center for Machine Learning Research, Peking University EMAIL, EMAIL EMAIL, EMAIL EMAIL, EMAIL EMAIL |
| Pseudocode | No | The paper describes its methodology in prose and through diagrams (Figure 1), but does not include any explicitly labeled pseudocode or algorithm blocks. The Lean code examples in the appendix are actual code demonstrations, not pseudocode for the described methodology. |
| Open Source Code | Yes | Our model, along with the datasets, are open-sourced to the public.1 The model can be found at https://huggingface.co/Frenzy Math/Herald_translator. The code of Herald and Lean Search can be found at https://github.com/frenzymath/herald_ translator. |
| Open Datasets | Yes | Our model, along with the datasets, are open-sourced to the public.1...We present the Herald dataset, generated from our pipeline on Mathlib4, containing 580k valid statements and 44k NL-FL theorem pairs. ...mini F2F (Zheng et al., 2022) A widely-used benchmark dataset for formal mathematics. |
| Dataset Splits | Yes | Model mini F2F Extract Theorem College Co T test valid Theorem Llama... The last two datasets (Extract Theorem and College Co T) are shuffled subsets of 200 samples each. ... The Proof Net dataset... We perform 128 parallel translations of the Proof Net validation set (with 185 natural language statements)... |
| Hardware Specification | No | The paper mentions using specific models like Deep Seek-Prover-V1.5-Base 7B and Deep Seek Prover 1.5 for its experiments and describes training configurations (e.g., "16-pass setting", "4 512 sample budget"), but it does not provide specific hardware details such as GPU or CPU models used for these experiments. |
| Software Dependencies | No | The paper mentions "Deep Seek-Prover-V1.5-Base 7B" as the base model used in its experiments, but it does not provide specific version numbers for programming languages or other software libraries and dependencies used in the implementation. |
| Experiment Setup | Yes | Our fine-tuning process consisted of two stages: first, we conducted a 2000-step warm-up using the Open Hermes2.5 dataset, followed by training on the mixed dataset. We used a learning rate of 4e-5 with a cosine decay schedule across 5 training epochs. ... The generation was completed efficiently using a 16-pass setting... For prover configuration, we use Deep Seek-Prover-V1.5-RL + RMax TS with 4 512 sample budget |