reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Neural Interactive Proofs

Authors: Lewis Hammond, Sam Adam-Day

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we support this theory with experiments in two domains: a toy graph isomorphism problem that illustrates the key ideas, and a code validation task using large language models. In so doing, we aim to create a foundation for future work on neural interactive proofs and their application in building safer AI systems.
Researcher Affiliation	Academia	Lewis Hammond EMAIL Sam Adam-Day EMAIL Department of Computer Science, University of Oxford, Oxford, United Kingdom
Pseudocode	No	The paper describes theoretical frameworks and experimental procedures but does not include any clearly labeled pseudocode or algorithm blocks. Procedural descriptions are provided in narrative text.
Open Source Code	Yes	(iv) a well-documented codebase for testing different protocols in different domains, available at https://github.com/Sam Adam Day/neural-interactive-proofs.
Open Datasets	Yes	Our second experiment involves a much more complex problem: checking that a given Python program satisfies a natural language specification. In particular, we make use of the Automated Programming Progress Standard (APPS) dataset (Hendrycks et al., 2021)
Dataset Splits	Yes	The train-test split is 80:20. The train-test split of the eventual dataset is 90:10.
Hardware Specification	No	The paper mentions using GPT-4o and GPT-4o-mini models and the Open AI fine-tuning API, but it does not specify the underlying hardware (e.g., GPU/CPU models, memory) used for running the experiments or training these models.
Software Dependencies	No	The paper mentions using algorithms like independent PPO and expert iteration, and models like GPT-4o and GPT-4o-mini via the OpenAI API. However, it does not specify version numbers for any key software components, libraries, or APIs used for implementation.
Experiment Setup	Yes	We use a clipped objective with value ϵ = 0.2, with hyperparameters γ = 0.95 and λ = 0.95. We additionally use advantage normalisation and entropy regularisation with coefficient 0.001. The learning rate is 0.003. For each protocol we train across 10 seeds for 5,000 steps. We train both provers and verifiers via the Open AI fine-tuning API using expert iteration for eight rounds (Anthony et al., 2017). This works by fine-tuning the models in each round on the rollouts on which they received positive reward. We use 10% of the underlying dataset at a time, iteratively adding positive examples to the fine-tuning dataset. Following Kirchner et al. (2024), we fine-tune each model from scratch in each iteration.