miniCTX: Neural Theorem Proving with (Long-)Contexts

Authors: Jiewen Hu, Thomas Zhu, Sean Welleck

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate several baselines on mini CTX, demonstrating the importance of context in real-world theorem proving. We study performance along various axes, including premise dependency, context length, difficulty, and the type of information in the context. Our experiments show that file-tuned models, which can utilize context at evaluation time, improve drastically over traditional state-tactic models in the context dependent setting. Moreover, we discover that this real-world performance boost cannot be readily measured by existing benchmarks such as mini F2F.
Researcher Affiliation Academia Jiewen Hu Thomas Zhu Sean Welleck Carnegie Mellon University
Pseudocode No The paper describes methods and processes in paragraph text and does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code Yes Project page: https://cmu-l3.github.io/minictx. Please refer to our project page for our dataset and evaluation links, and future updates including mini CTX-v2. We open-source the dataset and evaluation code. NTP-TOOLKIT is released on Git Hub: cmu-l3/ntp-toolkit, under the MIT license.
Open Datasets Yes We open-source the dataset and evaluation code. mini CTX is released on Hugging Face: l3lab/mini CTX, distributed under the Apache 2.0 license.
Dataset Splits Yes mini CTX is currently based on 762 theorems from six projects...These theorems are equally split into 381 validation and 381 test theorems. We ran NTP-TOOLKIT s next-tactic extraction on a 2023 snapshot of Mathlib, yielding 307,049 examples. ... split into 583k train, 15k dev, and 15k test examples.
Hardware Specification No The paper mentions training models and evaluating them, but does not provide specific hardware details such as GPU or CPU models used for these experiments.
Software Dependencies Yes We choose Lean (Moura & Ullrich, 2021) as the verifier v... we recommend evaluating projects that use Lean version 4.7.0 or higher (all mini CTX theorems are in 4.7.0).
Experiment Setup Yes For the file-tuning examples, as an initial method for handling the long Lean files, we either truncate the middle of an input file so that the file contents is 1024 tokens, or take only the preceding 1024 tokens, with the strategy selected at random for each example. We generate 8 such proof samples and measure pass@8 proof rate. we use a best-first search to construct full proofs (Han et al., 2022; Yang et al., 2023; Polu & Sutskever, 2020). we use the premise retriever provided by Lean Dojo (Yang et al., 2023) to identify the top 20 most relevant definitions or lemmas from imported modules and append them to the in-file context.