reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Language Models Synergize with Automated Machine Learning

Authors: Jinglue Xu, Jialong Li, Zhen Liu, NAV Suryanarayanan, Guoyuan Zhou, JIA GUO, Hitoshi Iba, Kenji Tei

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments across various ML tasks, our method outperforms existing methods in 10 out of 12 tasks for generating ML programs. In addition, auto ML significantly improves the performance of the generated ML programs. In experiments, given the textual task description, our method, Text-to-ML, generates the complete and optimized ML program in a fully autonomous process.
Researcher Affiliation	Academia	Jinglue Xu EMAIL University of Tokyo Jialong Li EMAIL Tokyo Institute of Technology Zhen Liu EMAIL University of Tokyo Nagar Anthel Venkatesh Suryanarayanan EMAIL University of Tokyo Guoyuan Zhou EMAIL Hosei University Institute of Integrated Science and Technology Jia Guo EMAIL Hosei University Institute of Integrated Science and Technology Hitoshi Iba EMAIL University of Tokyo Kenji Tei EMAIL Tokyo Institute of Technology
Pseudocode	Yes	Algorithm 1 Contextual Modular Generation Algorithm 2 Text-to-ML Algorithm 3 Constrained Generative Unit Testing
Open Source Code	Yes	The implementation of our method is available at https://github.com/JLX0/llm-automl.
Open Datasets	Yes	Boston The Boston dataset is a widely used regression dataset that contains information about housing in the suburbs of Boston. It consists of 506 samples, each representing a suburb, with 13 features describing various aspects of the housing environment, such as crime rate, average number of rooms, and accessibility to highways. The target variable is the median value of owner-occupied homes in thousands of dollars. This dataset is often used for regression tasks to predict housing prices based on the given features. We obtain the files of the dataset from Altavish (2023). Iris The Iris dataset is a classic classification dataset used to demonstrate various machine learning algorithms and techniques. It contains 150 samples of iris flowers from three different species: setosa, versicolor, and virginica. There are four features for each flower: sepal length, sepal width, petal length, and petal width. The goal is to classify the flowers into their respective species based on these features, making it a popular dataset for teaching and practicing classification algorithms. We obtain the files of the dataset from Learning (2023). CIFAR-10 The CIFAR-10 dataset is a well-known benchmark for image classification tasks. It consists of 60,000 32x32 color images across 10 different classes, with 6,000 images per class. The classes include objects like airplanes, cars, birds, cats, and more. The dataset is divided into a training set of 50,000 images and a test set of 10,000 images, making it suitable for evaluating the performance of various image classification algorithms. We obtain the files of the dataset from Krizhevsky et al. (2023). IMDb Reviews The IMDB dataset is often used for sentiment analysis and text classification tasks. ... We obtain the files of the dataset from Lakshmi25npathi (2023). AG News The AG News dataset is commonly used for text classification tasks, particularly for news categorization. ... We obtain the files of the dataset from Rai (2023).
Dataset Splits	Yes	CIFAR-10 The CIFAR-10 dataset is a well-known benchmark for image classification tasks. It consists of 60,000 32x32 color images across 10 different classes, with 6,000 images per class. The dataset is divided into a training set of 50,000 images and a test set of 10,000 images, making it suitable for evaluating the performance of various image classification algorithms. We obtain the files of the dataset from Krizhevsky et al. (2023).
Hardware Specification	Yes	For deep learning tasks, we train each model on 1 of 4 NVIDIA A100 GPUs.
Software Dependencies	No	In our experiments, we utilize the Python libraries Pytorch, Py Torch Lightning, and Transformers for deep learning tasks and Scikit-learn, XGBoost, Cat Boost, and Light GBM for tasks involving tabular data.
Experiment Setup	Yes	For BOHB, we use the default hyperparameters in Falkner et al. (2018) and set 30 epochs as the maximum budget for deep learning tasks. ... Table 5: Finetuning strategy, Name Type, Scale Range Batch size int, log [2, 64] Learning rate float, log [10 5, 10 1] Weight decay float, log [10 4, 10 1] Momentum float [0.01, 0.99] Optimizer cat {SGD, Adam, Adam W} Scheduler cat {plateau, cosine}