reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Authors: Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Paper Bench, a benchmark evaluating the ability of AI agents to replicate stateof-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments... We evaluate several frontier models on Paper Bench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.
Researcher Affiliation	Industry	1Open AI, San Francisco, USA. Correspondence to: Giulio Starace <EMAIL>.
Pseudocode	No	The paper describes the Paper Bench benchmark, its evaluation methodology, and results from evaluating AI agents. It details prompts and system messages for the judge and agents but does not present any formal pseudocode or algorithm blocks within its own content for a method or procedure.
Open Source Code	Yes	We open-source our code to facilitate future research in understanding the AI engineering capabilities of AI agents.
Open Datasets	Yes	Our contributions include: Paper Bench: a benchmark of 20 ML research papers and author-approved rubrics, and an automated grading workflow using LLM-based judges. Paper Bench Code-Dev: a more lightweight variant of the benchmark... Judge Eval: a dataset of human-graded submissions, which can be used as an auxiliary evaluation for the development and assessment of automated judges. We open-source our code to facilitate future research in understanding the AI engineering capabilities of AI agents.
Dataset Splits	No	The paper describes a benchmark (Paper Bench) consisting of 20 ML papers and a separate evaluation dataset (Judge Eval) using partial replications of 5 papers. It details the selection process for these papers but does not provide traditional train/test/validation dataset splits with percentages or sample counts for any dataset used within the paper's own experimental evaluations.
Hardware Specification	Yes	When the candidate s task attempt ends, we copy its submission to a fresh VM running an Ubuntu 24.04 image with access to an A10 GPU. In our experiments, we run each agent in an Ubuntu 24.04 Docker container that has access to a single A10 GPU.
Software Dependencies	No	The paper mentions using 'Ubuntu 24.04' for the operating system and names 'Inspect AI s basic agent' and 'nanoeval' for orchestration, but does not provide specific version numbers for these or other key software libraries (e.g., Python, PyTorch, TensorFlow) that would be essential for reproducing the machine learning experiments.
Experiment Setup	Yes	In our experiments, we run each agent in an Ubuntu 24.04 Docker container that has access to a single A10 GPU. The agent s local work directory contains the paper in PDF and Markdown format, the paper s addendum, and a text file containing instructions... We give agents a maximum run-time of 12 hours. We provide the agent with an API key for Hugging Face and the Open AI API with $1000 loaded so it can make use of those services during its run.