reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws

Authors: Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, Zico Kolter

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that ADO can achieve comparable or better performance than prior methods while maintaining computational efficiency across different computation scales, offering a practical solution for dynamically adjusting data distribution without sacrificing flexibility or increasing costs.
Researcher Affiliation	Academia	Carnegie Mellon University Stanford University Princeton University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Adaptive Data Optimization (ADO) 1: Input: prior µ K, update interval tupdate, warmup duration twarmup, γ1, γ2, s, δmin
Open Source Code	Yes	The code for this work is available at https://github.com/yidingjiang/ado.
Open Datasets	Yes	We conduct all our experiments on the Pile dataset (Gao et al., 2020) with decoder-only transformer language models (Vaswani et al., 2017) of varying sizes. We consider two types of metrics: 1. validation loss on the Pile, an unweighted version of the pile validation set where each domain receives equal probability, Slim Pajama (Soboleva et al., 2023), and a 1 billion token subset of Fine Web (Penedo et al., 2024)
Dataset Splits	Yes	We conduct all our experiments on the Pile dataset (Gao et al., 2020)... We consider two types of metrics: 1. validation loss on the Pile, an unweighted version of the pile validation set... Slim Pajama (Soboleva et al., 2023), and a 1 billion token subset of Fine Web (Penedo et al., 2024), 2. zero-shot downstream performance on 6 common-sense reasoning domains from the language model evaluation harness (Gao et al., 2024)...
Hardware Specification	Yes	All experiments were run on Google Cloud TPUs. On a TPU v3-128, a 1B model can be trained for 60,000 steps in 3.5 days.
Software Dependencies	No	We ran our experiments on TPUs using the open-source mid GPT library (Zhou et al., 2023), which is based on JAX (Bradbury et al., 2018) and Equinox (Kidger & Garcia, 2021). All experiments were run on Google Cloud TPUs. On a TPU v3-128, a 1B model can be trained for 60,000 steps in 3.5 days. Fitting scaling laws for all domains takes less than 20 seconds over the course of a training run, this amounts to under 19 minutes in additional time spent fitting scaling laws.
Experiment Setup	Yes	Training. All models were trained for 60,000 steps at batch size 2048 (1B) or 256 (124M), using Adam W (Loshchilov, 2017). We use decoupled weight decay with λ = 10 4, set β2 = 0.05, and otherwise use default hyperparameters as specified by Optax (Deep Mind et al., 2020). For ADO, we fit scaling laws for each domain every 1,000 training steps starting at step 5,000 (we run an empirical sampling strategy for the first 5,000 steps).