reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OmniPred: Language Models as Universal Regressors

Authors: Xingyou Song, Oscar Li, Chansoo Lee, Bangding Yang, Daiyi Peng, Sagi Perel, Yutian Chen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	our extensive experiments demonstrate that language models are capable of very precise numerical regression using only textual representations of mathematical parameters and values, and if given the opportunity to train at scale over multiple tasks, can significantly outperform traditional regression models.
Researcher Affiliation	Collaboration	Xingyou Song1 , Oscar Li2 , Chansoo Lee1, Bangding (Jeffrey) Yang3, Daiyi Peng1, Sagi Perel1, Yutian Chen1 1Google Deep Mind, 2Carnegie Mellon University, 3Google Equal Contribution. Work performed as a student researcher at Google Deep Mind.
Pseudocode	No	The paper describes the methodology in text and mathematical formulations (e.g., Equation 1) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We do not release any of the trained checkpoints, as it may be possible to reverse-engineer parts of the training data, which can lead to privacy violations and data leakage. The paper mentions external open-source tools used, like T5X, Open Source Vizier, and Init2Winit, but does not provide specific source code for the Omni Pred methodology itself.
Open Datasets	Yes	BBOB (Shifted): For precise controlled experiments where we can generate synthetic datasets and perform online evaluations, we create a multi-task version of the BBOB benchmark (El Hara et al., 2019) containing 24 different synthetic functions
Dataset Splits	Yes	(3) deciding on a fixed train/validation/test splitting ratio (default 0.8/0.1/0.1)
Hardware Specification	Yes	The model ( 200M parameters) was pretrained using a 4x4 TPU V3. ... we used a single 1x1 TPU V3.
Software Dependencies	No	The paper mentions several software components like T5X (Raffel et al., 2020), Sentence Piece tokenizer (Kudo & Richardson, 2018), and XGBoost (Chen & Guestrin, 2016), but does not provide specific version numbers for these tools as used in their experiments.
Experiment Setup	Yes	Optimizer: Adafactor with base learning rate 0.01 and square root decay. Batch size 256. ... We use the same settings from pretraining for consistency, but allow a maximum of 30 epochs. ... Single-task training: ...larger constant learning rate of 10-3... Finetuning: ...smaller fixed learning rate of 10-5... We restrict the logits to only decode the custom floating point tokens for representing y-values. To maximize batch size for a 1x1 TPU V3, we generate 64 samples and select the empirical median of these floating point samples as our final prediction when computing prediction error.