reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Formal Mathematical Reasoning—A New Frontier in AI

Authors: Kaiyu Yang, Gabriel Poesia, Jingxuan He, Wenda Li, Kristin E. Lauter, Swarat Chaudhuri, Dawn Song

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This position paper advocates formal mathematical reasoning as an indispensable component in future AI for math, formal verification, and verifiable generation. We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success.
Researcher Affiliation	Collaboration	1Meta FAIR 2Stanford University 3UC Berkeley 4University of Edinburgh 5UT Austin. Correspondence to: Kaiyu Yang <EMAIL>, Dawn Song <EMAIL>.
Pseudocode	No	The paper describes conceptual architectures and workflows (e.g., Figure 5 "A common architecture for neural theorem proving") but does not present structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link for the source code related to its own methodology. It mentions various external tools and systems used by other research (e.g., Lean, SymPy, Alpha Proof) but does not release its own code.
Open Datasets	Yes	benchmarks such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). ... The Proof Pile dataset (Azerbayev et al., 2024)... Numina Math: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions. https://github.com/projectnumina/aimo-progress-prize/blob/main/report/numina_dataset.pdf, 2024a.
Dataset Splits	No	The paper is a position paper and does not describe specific experiments or provide dataset split information for its own work. It refers to existing benchmarks and datasets (e.g., MATH, GSM8K) but does not detail how data would be split for a specific experimental setup within this paper.
Hardware Specification	No	The paper does not provide specific hardware details for any experiments conducted by the authors, as it is a position paper summarizing existing progress and future directions.
Software Dependencies	No	The paper discusses various formal systems and external tools (e.g., Lean, Sym Py) in a general context but does not specify particular software dependencies with version numbers required to reproduce any work presented within this position paper.
Experiment Setup	No	The paper does not provide specific experimental setup details, hyperparameters, or training configurations, as it is a position paper outlining perspectives and challenges, not presenting new experimental results.