MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Authors: Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, Lilian Weng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We establish human baselines for each competition using Kaggle s publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup Open AI s o1-preview with AIDE scaffolding achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training. |
| Researcher Affiliation | Industry | Chan Jun Shern*, Neil Chowdhury*, Oliver Jaffe*, James Aung*, Dane Sherburn*, Evan Mays*, Giulio Starace*, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng , Aleksander M adry Work done while at Open AI. |
| Pseudocode | No | The paper describes methodologies and experimental setups but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents. |
| Open Datasets | Yes | The resulting benchmark consists of 75 diverse Kaggle competitions across a variety of domains... For each competition, we use the original dataset if publicly available... 1kaggle.com/datasets/kaggle/meta-kaggle (accessed May 15th, 2024)) |
| Dataset Splits | Yes | For each competition, we use the original dataset if publicly available, although Kaggle competitions often do not release the test set even after the competition ends. In such cases, we manually create new train and test splits based on the publicly available training data2. We take care to ensure that the distributions of the original and reconstructed test sets are similar by checking that the example submission scores similarly on both sets. We take the new test set to be 10% of the original train set, except for when it didn t make sense to do so 3. |
| Hardware Specification | Yes | On each run, agents have access to a machine with 36 v CPUs, 440GB RAM, 4095 Gi B SSD, and a single Nvidia A10 GPU. In our experiments, unless otherwise stated, each agent is executed within a Microsoft Azure Standard_NV36ads_A10_v5 virtual machine, which has 36 AMD EPYC 74F3v (Milan) [x86-64] v CPUs, 440GB memory, and one Nvidia A10 GPU (24GB). |
| Software Dependencies | No | In our experiments, we run agents in an Ubuntu 20.04 Docker container containing the dataset, validation server, and Python packages that might be helpful for ML engineering. |
| Experiment Setup | Yes | For each of the 75 competitions, agents have a maximum of 24 hours to produce a submission. On each run, agents have access to a machine with 36 v CPUs, 440GB RAM, 4095 Gi B SSD, and a single Nvidia A10 GPU. We repeat all experiments with 3 seeds (that is, 3 runs per competition) to compute the mean and standard error unless otherwise specified. Full details of our execution environment and scaffolds can be found in Appendices A.5 and A.6. |