reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmark and Survey of Automated Machine Learning Frameworks

Authors: Marc-André Zöller, Marco F. Huber

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper is a combination of a survey on current Auto ML methods and a benchmark of popular Auto ML frameworks on real data sets. Driven by the selected frameworks for evaluation, we summarize and review important Auto ML techniques and methods concerning every step in building an ML pipeline. The selected Auto ML frameworks are evaluated on 137 data sets from established Auto ML benchmark suites.
Researcher Affiliation	Collaboration	Marc-Andre Zoller EMAIL USU Software AG Ruppurrer Str. 1, Karlsruhe, Germany Marco F. Huber EMAIL Institute of Industrial Manufacturing and Management IFF, University of Stuttgart, Allmandring 25, Stuttgart, Germany & Fraunhofer Institute for Manufacturing Engineering and Automation IPA Nobelstr. 12, Stuttgart, Germany
Pseudocode	No	The paper describes various algorithms and methods for Auto ML, such as Sequential Model-based Optimization (SMBO), Grid Search, Random Search, Evolutionary Algorithms, and others, but it does not present them in explicit pseudocode blocks or clearly labeled algorithm sections. The descriptions are given in prose.
Open Source Code	Yes	Appendix A. Framework Source Code Table 11 lists the repositories of all evaluated open-source Auto ML tools. Some methods are still under active development and may differ significantly from the evaluated versions. Algorithm Type Source Code Custom Both https://github.com/Ennosigaeon/automl_benchmark
Open Datasets	Yes	All previously introduced methods for performance evaluations only consider selecting and tuning a modeling algorithm. Data cleaning and feature engineering are ignored completely even though those two steps have a significant impact on the final performance of an ML pipeline (Chu et al., 2016). The only possibility to capture and evaluate all aspects of Auto ML algorithms is using real data sets. However, real data sets also introduce a significant evaluation overhead, as for each pipeline multiple ML models have to be trained. Depending on the complexity and size of the data set, testing a single pipeline can require several hours of wall clock time. In total, multiple months of CPU time were necessary to conduct all evaluations with real data sets presented in this benchmark. As explained in Section 2, the performance of an Auto ML algorithm depends on the tested data set. Consequently, it is not useful to evaluate the performance on only a few data sets in detail but instead the performance is evaluated on a wide range of different data sets. To ensure reproducibility of the results, only publicly available data sets from Open ML (Vanschoren et al., 2014), a collaborative platform for sharing data sets in a standardized format, have been selected. More specifically, a combination of the curated benchmarking suites Open ML1006 (Bischl et al., 2017), Open ML-CC187 (Bischl et al., 2019) and Auto ML Benchmark8 (Gijsbers et al., 2019) is used. The combination of these benchmarking suits contains 137 classification tasks with high-quality data sets having between 500 and 600,000 samples and less than 7,500 features. High-quality does not imply that no preprocessing of the data is necessary as, for example, some data sets contain missing values. A complete list of all data sets with some basic meta-features is provided in Appendix C.
Dataset Splits	Yes	The performance of each configuration is determined using a 4-fold cross-validation with three folds passed to the optimizer and using the last fold to calculate a test-performance.
Hardware Specification	Yes	All experiments are conducted using n1-standard-8 virtual machines from Google Cloud Platform equipped with Intel Xeon E5 processors with 8 cores and 30 GB memory.
Software Dependencies	Yes	Each virtual machine uses Ubuntu 18.04.02, Python 3.6.7 and scikit-learn 0.21.3.
Experiment Setup	Yes	1. Synthetic test functions (see Section 9.3) are limited to exactly 250 iterations. The performance is defined as the minimal absolute distance min λi Λ \|f( λi) f( λ )\| between the considered configurations λi and the global optimum λ . 2. CASH solvers (see Section 9.5.1) are limited to exactly 325 iterations. Preliminary evaluations have shown that all algorithms basically always converge before hitting this iteration limit. The model fitting in each iteration is limited to a cut-offtime of ten minutes. Configurations violating this time limit are assigned the worst possible performance. The performance of each configuration is determined using a 4-fold cross-validation with three folds passed to the optimizer and using the last fold to calculate a test-performance. As loss function, the accuracy LAcc(ˆy, y) = 1 i=1 1(ˆyi = yi) (6) is used, with 1 being an indicator function. 3. Auto ML frameworks (see Section 9.5.2) are limited by a soft-limit of 1 hour and a hard-limit of 1.25 hours. Fitting of single configurations is aborted after ten minutes if the framework supports a cut-offtime. The performance of each configuration is determined using a 4-fold cross-validation with three folds passed to the Auto ML framework4 and using the last fold to calculate a test-performance. As loss function, the accuracy in Equation (6) is used. Frameworks supporting parallelization are configured to use eight threads. Furthermore, frameworks supporting memory limits are configured to use at most 4096 MB memory per thread.