reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AMLB: an AutoML Benchmark

Authors: Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, Joaquin Vanschoren

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a thorough comparison of 9 well-known Auto ML frameworks across 71 classiﬁcation and 33 regression tasks. The diﬀerences between the Auto ML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-oﬀs with inference time, and framework failures.
Researcher Affiliation	Collaboration	1 Eindhoven University of Technology, Eindhoven, The Netherlands 2 Ludwig Maximilian University of Munich, Munich, Germany 3 H2O.ai, Mountain View, CA, United States 4 Radboud University, Nijmegen, The Netherlands
Pseudocode	No	The paper describes the methodologies and architectures of various Auto ML frameworks (Section 3) and the AMLB tool (Section 4) in prose, but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	To ensure reproducibility2, we provide an open-source benchmarking tool3 that allows easy integration with Auto ML frameworks, and performs end-to-end evaluations thereof on carefully curated sets of open data sets. ... 3. Code, results, and documentation at: https://openml.github.io/automlbenchmark/
Open Datasets	Yes	To ensure reproducibility2, we provide an open-source benchmarking tool3 that allows easy integration with Auto ML frameworks, and and performs end-to-end evaluations thereof on carefully curated sets of open data sets. ... Freely available and hosted on Open ML. Data sets that can only be used on speciﬁc platforms or are not shared freely for any reasons are not included in the benchmark. ... Visit www.openml.org/s/269 for regression and www.openml.org/s/271 for classiﬁcation.
Dataset Splits	Yes	An Open ML benchmark suite is a collection of Open ML tasks, which each reference a data set, an evaluation procedure (such as k-fold cross-validation) and its splits, the target feature, and the type of task (regression or classiﬁcation). ... This means that within the 10-fold crossvalidation we perform in our experiments, either 4 or 5 of those instances are available in the training splits.
Hardware Specification	Yes	As discussed in Section 4, AMLB can be run on any machine. However, for comparable hardware and easy expandability, we opt to conduct the benchmark on standard m5.2xlarge17 instances available on Amazon Web Services (AWS). These represent current commodity-level hardware with 32 GB memory, 8 v CPUs (Intel Xeon Platinum 8000 series Skylake-SP processor with a sustained all core Turbo CPU clock speed of up to 3.1 GHz). 100 GB of gp3-SSD storage is available for storage, which can be necessary for storing a larger number of evaluated pipelines.
Software Dependencies	Yes	We use the implementations provided by SCIKIT-LEARN 1.2.2. ... Table 16: Used Auto ML framework versions in the experiments. framework 2021 2023 latest notes AUTOGLUON 0.3.1 0.8.0 1.0.0 ... AUTO-SKLEARN 0.14.0 0.15.0 0.15.0 ... FLAML 0.6.2 1.2.4 2.1.2 ...
Experiment Setup	Yes	Auto ML frameworks are instantiated with their default conﬁguration, except that we control the following settings: Mode to declare the user intent. ... Runtime for the search. ... Resource constraints that specify the number of CPU cores and amount of memory available. ... Target metric to use for optimization. This is the same metric that is used for evaluation in the benchmark.