PaperBench: Evaluating AI’s Ability to Replicate AI Research
Authors: Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Paper Bench, a benchmark evaluating the ability of AI agents to replicate stateof-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments... We evaluate several frontier models on Paper Bench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. |
| Researcher Affiliation | Industry | 1Open AI, San Francisco, USA. Correspondence to: Giulio Starace <EMAIL>. |
| Pseudocode | No | The paper describes the Paper Bench benchmark, its evaluation methodology, and results from evaluating AI agents. It details prompts and system messages for the judge and agents but does not present any formal pseudocode or algorithm blocks within its own content for a method or procedure. |
| Open Source Code | Yes | We open-source our code to facilitate future research in understanding the AI engineering capabilities of AI agents. |
| Open Datasets | Yes | Our contributions include: Paper Bench: a benchmark of 20 ML research papers and author-approved rubrics, and an automated grading workflow using LLM-based judges. Paper Bench Code-Dev: a more lightweight variant of the benchmark... Judge Eval: a dataset of human-graded submissions, which can be used as an auxiliary evaluation for the development and assessment of automated judges. We open-source our code to facilitate future research in understanding the AI engineering capabilities of AI agents. |
| Dataset Splits | No | The paper describes a benchmark (Paper Bench) consisting of 20 ML papers and a separate evaluation dataset (Judge Eval) using partial replications of 5 papers. It details the selection process for these papers but does not provide traditional train/test/validation dataset splits with percentages or sample counts for any dataset used within the paper's own experimental evaluations. |
| Hardware Specification | Yes | When the candidate s task attempt ends, we copy its submission to a fresh VM running an Ubuntu 24.04 image with access to an A10 GPU. In our experiments, we run each agent in an Ubuntu 24.04 Docker container that has access to a single A10 GPU. |
| Software Dependencies | No | The paper mentions using 'Ubuntu 24.04' for the operating system and names 'Inspect AI s basic agent' and 'nanoeval' for orchestration, but does not provide specific version numbers for these or other key software libraries (e.g., Python, PyTorch, TensorFlow) that would be essential for reproducing the machine learning experiments. |
| Experiment Setup | Yes | In our experiments, we run each agent in an Ubuntu 24.04 Docker container that has access to a single A10 GPU. The agent s local work directory contains the paper in PDF and Markdown format, the paper s addendum, and a text file containing instructions... We give agents a maximum run-time of 12 hours. We provide the agent with an API key for Hugging Face and the Open AI API with $1000 loaded so it can make use of those services during its run. |