reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM-Select: Feature Selection with Large Language Models

Authors: Daniel P Jeong, Zachary Chase Lipton, Pradeep Kumar Ravikumar

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
Researcher Affiliation	Collaboration	Daniel P. Jeong1 EMAIL Zachary C. Lipton1,2 EMAIL Pradeep Ravikumar1 EMAIL 1Machine Learning Department, Carnegie Mellon University 2Abridge AI
Pseudocode	No	The paper describes the proposed methods (LLM-Score, LLM-Rank, LLM-Seq) using mathematical formulations (Equations 1, 2, 3) and descriptive text, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Source code. To ensure the reproducibility of our results, we open-source the source code used for all of our evaluations detailed below via our Git Hub repository2. 2https://github.com/taekb/llm-select
Open Datasets	Yes	We use seven binary classification datasets (Credit-G, Bank, Give Me Some Credit, COMPAS Recidivism, Pima Indians Diabetes, AUS Cars, You Tube) and seven regression datasets (CA Housing, Diabetes Progression, Wine Quality, Miami Housing, Used Cars, NBA, NYC Rideshare)... We construct supersets of the Income, Employment, Public Coverage, and Mobility datasets from folktables (Ding et al., 2021)... We also manually extract three datasets from the MIMIC-IV database (Johnson et al., 2023)... MIMIC-IV (Johnson et al., 2023) is an open-access database...
Dataset Splits	Yes	For all datasets, we randomly shuffle and take a 80 20 train test split. We then take a 5-fold split of the training set for cross-validation, where the 5-fold splits vary across the random seeds (=[1,2,3,4,5]) used throughout the experiments. The test set remains fixed and does not vary with the random seed used. For classification datasets, we always take a stratified split to preserve the label proportions across the train, validation, and test sets. For each dataset, we randomly shuffle and take a 64-16-20 train-validation-test split, where the train-validation splits vary across the 5 random seeds (=[1,2,3,4,5]) used in the experiments, and the test set remains fixed regardless of the seed.
Hardware Specification	No	The paper specifies the LLM models used (GPT-4, GPT-3.5, Llama-2 variants) and mentions using their APIs or Hugging Face checkpoints with the vLLM framework, but does not provide specific details about the underlying hardware (e.g., GPU models, CPU types, memory) used to run these models or experiments.
Software Dependencies	No	The paper mentions using the OpenAI API models (gpt-4-0613, gpt-3.5-turbo), Hugging Face checkpoints for Llama-2, the vLLM framework, scikit-learn, Light GBM, and various optimizers (L-BFGS, Adam, SAGA), as well as the Lang Chain API. However, it does not provide specific version numbers for any of these software components, which is necessary for reproducibility.
Experiment Setup	Yes	For logistic regression, we minimize the negative log-likelihood using the L-BFGS optimizer (Zhu et al., 1997) and use importance weighting to balance the weights of the positive and negative samples... For classification tasks, we minimize the binary cross-entropy loss using the Adam optimizer (Kingma & Ba, 2015) with the default learning rate of 10 3 and the default momentum hyperparameters of β1 = 0.9, β2 = 0.999, and ϵ = 10 8... For Lasso Net and the LASSO, we first compute the regularization paths with warm starts (Friedman et al., 2010)... For forward/backward sequential selection, we greedily add/remove a new feature at each iteration based on the 5-fold cross-validation performance... Hyperparameter Search Space for Light GBM: Weak Learner: Gradient-Boosted Decision Tree, Maximum Number of Weak Learners: 50, Maximum Number of Leaves Discrete({20, 21, . . . , 60})... Hyperparameter Search Space for MLP: Number of Hidden Units Discrete({200, 201, . . . , 500}), Number of Hidden Layers Discrete({2, 3, 4}), Dropout Probability Uniform(0, 0.5), Batch Size Discrete({256, 512, 1024}), Learning Rate Log Uniform(10 4, 10 2), Maximum Number of Epochs: 15.