reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding LLM Embeddings for Regression

Authors: Eric Tang, Bangding Yang, Xingyou Song

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This paper investigates the behavior of these LLM embeddings when used as features for standard tabular regression tasks. Most notably, our findings are: LLM embeddings are dimensionally robust, i.e. regression performance can remain strong even over high-dimensional data, whereas traditional representations significantly suffer.
Researcher Affiliation	Collaboration	Eric Tang EMAIL Stanford University Google Deep Mind Academy Program Bangding Yang EMAIL Google Xingyou Song EMAIL Google Deep Mind
Pseudocode	No	The paper describes the methods and procedures in narrative text and figures but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions public codebases for some of the datasets/benchmarks used (e.g., Init2Winit, Open XLA), but does not provide specific access information or an explicit statement about the release of source code for the methodology described in this paper by the authors. For example: "Public codebase can be found in https://github.com/google/ init2winit." This refers to a dataset's codebase, not the authors' implementation.
Open Datasets	Yes	We use 23 functions defined from the standard Black-Box Optimization Benchmarking (BBOB) suite (Elhara et al., 2019), supporting continuous inputs of any dimension. Auto ML (Google Cloud, 2023): Automated Machine Learning service for Tensorflow Extended (Google, 2023) pipelines (e.g. batch size, activation, layer counts) over tabular or text data. Init2Winit (Dahl et al., 2023): Learning rate scheduling parameters influencing common image classification tasks (e.g. Res Nets on CIFAR-10 and Image Net). XLA (Phothilimthana et al., 2021): Tuning for the Accelerated Linear Algebra (XLA) compiler which affects LLM serving latencies. L2DA (Yazdanbakhsh et al., 2021): "Learning to Design Accelerators", for improving accelerators such as TPUs and corresponding computer architectures to improve hardware performance.
Dataset Splits	Yes	We may either sample (x, y) pairs (in the case of synthetic objectives where x are uniformly sampled from X), or use the given offline data (in the case of real-world tasks, where they were actual evaluations from an optimization trajectory), using a standard 8-1-1 train-validation-test split. Each task s data consists of 500 (x, y) evaluations sampled uniformly across the input space, using a 8-1-1 split for train-validation-test.
Hardware Specification	No	LLM inference almost always requires accelerator usage, making them more expensive if needed in serious regression tasks. However, compute costs for inference are orders of magnitude cheaper than for training, and typically only require a few GPUs or TPUs, making embedding-based regression still very feasible for most academic labs or industries. The paper only mentions 'GPUs or TPUs' in general terms without specifying any particular models, versions, or configurations.
Software Dependencies	No	The paper mentions software components like "MLP prediction head", "XGBoost", "T5", and "Gemini" but does not provide specific version numbers for any of these or other software libraries/dependencies.
Experiment Setup	Yes	Regression Head: MLP with 2 Re LU hidden layers of dimension 256. Input-Normalization: We linearly scale each coordinate in ϕ to [ 1, 1], using its minimum and maximum observed values as the original endpoints. y-Normalization: We compute the empirical mean µ and standard deviation σ over all y-values in the task s training data, and apply y (y µ)/σ as a preprocessing step. Optimizer: Adam W with sweeped learning rates across {1e-4, 5e-4, 1e-3, 5e-3, 1e-2} and weight decay across {0, 1e-1, 1}. Loss: Mean Squared Error. Maximum Epochs: 300, with early stopping enabled. Our XGBoost uses the same input normalization method as the MLP. We additionally grid-searched over the following parameters for each task: min_child_weight": [1, 5, 10] learning_rate": [0.001, 0.01, 0.1] gamma": [0.0, 0.3, 0.5] subsample": [0.6, 0.8, 1.0] colsample_bytree": [0.6, 0.8, 1.0] max_depth": [3, 5, 7]