reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Authors: Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, Lilian Weng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We establish human baselines for each competition using Kaggle s publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup Open AI s o1-preview with AIDE scaffolding achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training.
Researcher Affiliation	Industry	Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace*, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng , Aleksander M adry Work done while at Open AI.
Pseudocode	No	The paper describes methodologies and experimental setups but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.
Open Datasets	Yes	The resulting benchmark consists of 75 diverse Kaggle competitions across a variety of domains... For each competition, we use the original dataset if publicly available... 1kaggle.com/datasets/kaggle/meta-kaggle (accessed May 15th, 2024))
Dataset Splits	Yes	For each competition, we use the original dataset if publicly available, although Kaggle competitions often do not release the test set even after the competition ends. In such cases, we manually create new train and test splits based on the publicly available training data2. We take care to ensure that the distributions of the original and reconstructed test sets are similar by checking that the example submission scores similarly on both sets. We take the new test set to be 10% of the original train set, except for when it didn t make sense to do so 3.
Hardware Specification	Yes	On each run, agents have access to a machine with 36 v CPUs, 440GB RAM, 4095 Gi B SSD, and a single Nvidia A10 GPU. In our experiments, unless otherwise stated, each agent is executed within a Microsoft Azure Standard_NV36ads_A10_v5 virtual machine, which has 36 AMD EPYC 74F3v (Milan) [x86-64] v CPUs, 440GB memory, and one Nvidia A10 GPU (24GB).
Software Dependencies	No	In our experiments, we run agents in an Ubuntu 20.04 Docker container containing the dataset, validation server, and Python packages that might be helpful for ML engineering.
Experiment Setup	Yes	For each of the 75 competitions, agents have a maximum of 24 hours to produce a submission. On each run, agents have access to a machine with 36 v CPUs, 440GB RAM, 4095 Gi B SSD, and a single Nvidia A10 GPU. We repeat all experiments with 3 seeds (that is, 3 runs per competition) to compute the mean and standard error unless otherwise specified. Full details of our execution environment and scaffolds can be found in Appendices A.5 and A.6.