Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

AMLB: an AutoML Benchmark

Authors: Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, Joaquin Vanschoren

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a thorough comparison of 9 well-known Auto ML frameworks across 71 classification and 33 regression tasks. The differences between the Auto ML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures.
Researcher Affiliation Collaboration 1 Eindhoven University of Technology, Eindhoven, The Netherlands 2 Ludwig Maximilian University of Munich, Munich, Germany 3 H2O.ai, Mountain View, CA, United States 4 Radboud University, Nijmegen, The Netherlands
Pseudocode No The paper describes the methodologies and architectures of various Auto ML frameworks (Section 3) and the AMLB tool (Section 4) in prose, but does not present any formal pseudocode or algorithm blocks.
Open Source Code Yes To ensure reproducibility2, we provide an open-source benchmarking tool3 that allows easy integration with Auto ML frameworks, and performs end-to-end evaluations thereof on carefully curated sets of open data sets. ... 3. Code, results, and documentation at: https://openml.github.io/automlbenchmark/
Open Datasets Yes To ensure reproducibility2, we provide an open-source benchmarking tool3 that allows easy integration with Auto ML frameworks, and and performs end-to-end evaluations thereof on carefully curated sets of open data sets. ... Freely available and hosted on Open ML. Data sets that can only be used on specific platforms or are not shared freely for any reasons are not included in the benchmark. ... Visit www.openml.org/s/269 for regression and www.openml.org/s/271 for classification.
Dataset Splits Yes An Open ML benchmark suite is a collection of Open ML tasks, which each reference a data set, an evaluation procedure (such as k-fold cross-validation) and its splits, the target feature, and the type of task (regression or classification). ... This means that within the 10-fold crossvalidation we perform in our experiments, either 4 or 5 of those instances are available in the training splits.
Hardware Specification Yes As discussed in Section 4, AMLB can be run on any machine. However, for comparable hardware and easy expandability, we opt to conduct the benchmark on standard m5.2xlarge17 instances available on Amazon Web Services (AWS). These represent current commodity-level hardware with 32 GB memory, 8 v CPUs (Intel Xeon Platinum 8000 series Skylake-SP processor with a sustained all core Turbo CPU clock speed of up to 3.1 GHz). 100 GB of gp3-SSD storage is available for storage, which can be necessary for storing a larger number of evaluated pipelines.
Software Dependencies Yes We use the implementations provided by SCIKIT-LEARN 1.2.2. ... Table 16: Used Auto ML framework versions in the experiments. framework 2021 2023 latest notes AUTOGLUON 0.3.1 0.8.0 1.0.0 ... AUTO-SKLEARN 0.14.0 0.15.0 0.15.0 ... FLAML 0.6.2 1.2.4 2.1.2 ...
Experiment Setup Yes Auto ML frameworks are instantiated with their default configuration, except that we control the following settings: Mode to declare the user intent. ... Runtime for the search. ... Resource constraints that specify the number of CPU cores and amount of memory available. ... Target metric to use for optimization. This is the same metric that is used for evaluation in the benchmark.