reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

miniCTX: Neural Theorem Proving with (Long-)Contexts

Authors: Jiewen Hu, Thomas Zhu, Sean Welleck

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate several baselines on mini CTX, demonstrating the importance of context in real-world theorem proving. We study performance along various axes, including premise dependency, context length, difficulty, and the type of information in the context. Our experiments show that file-tuned models, which can utilize context at evaluation time, improve drastically over traditional state-tactic models in the context dependent setting. Moreover, we discover that this real-world performance boost cannot be readily measured by existing benchmarks such as mini F2F.
Researcher Affiliation	Academia	Jiewen Hu Thomas Zhu Sean Welleck Carnegie Mellon University
Pseudocode	No	The paper describes methods and processes in paragraph text and does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Project page: https://cmu-l3.github.io/minictx. Please refer to our project page for our dataset and evaluation links, and future updates including mini CTX-v2. We open-source the dataset and evaluation code. NTP-TOOLKIT is released on Git Hub: cmu-l3/ntp-toolkit, under the MIT license.
Open Datasets	Yes	We open-source the dataset and evaluation code. mini CTX is released on Hugging Face: l3lab/mini CTX, distributed under the Apache 2.0 license.
Dataset Splits	Yes	mini CTX is currently based on 762 theorems from six projects...These theorems are equally split into 381 validation and 381 test theorems. We ran NTP-TOOLKIT s next-tactic extraction on a 2023 snapshot of Mathlib, yielding 307,049 examples. ... split into 583k train, 15k dev, and 15k test examples.
Hardware Specification	No	The paper mentions training models and evaluating them, but does not provide specific hardware details such as GPU or CPU models used for these experiments.
Software Dependencies	Yes	We choose Lean (Moura & Ullrich, 2021) as the verifier v... we recommend evaluating projects that use Lean version 4.7.0 or higher (all mini CTX theorems are in 4.7.0).
Experiment Setup	Yes	For the file-tuning examples, as an initial method for handling the long Lean files, we either truncate the middle of an input file so that the file contents is 1024 tokens, or take only the preceding 1024 tokens, with the strategy selected at random for each example. We generate 8 such proof samples and measure pass@8 proof rate. we use a best-first search to construct full proofs (Han et al., 2022; Yang et al., 2023; Polu & Sutskever, 2020). we use the premise retriever provided by Lean Dojo (Yang et al., 2023) to identify the top 20 most relevant definitions or lemmas from imported modules and append them to the in-file context.