Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws

Authors: Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, Zico Kolter

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that ADO can achieve comparable or better performance than prior methods while maintaining computational efficiency across different computation scales, offering a practical solution for dynamically adjusting data distribution without sacrificing flexibility or increasing costs.
Researcher Affiliation Academia Carnegie Mellon University Stanford University Princeton University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Adaptive Data Optimization (ADO) 1: Input: prior µ K, update interval tupdate, warmup duration twarmup, γ1, γ2, s, δmin
Open Source Code Yes The code for this work is available at https://github.com/yidingjiang/ado.
Open Datasets Yes We conduct all our experiments on the Pile dataset (Gao et al., 2020) with decoder-only transformer language models (Vaswani et al., 2017) of varying sizes. We consider two types of metrics: 1. validation loss on the Pile, an unweighted version of the pile validation set where each domain receives equal probability, Slim Pajama (Soboleva et al., 2023), and a 1 billion token subset of Fine Web (Penedo et al., 2024)
Dataset Splits Yes We conduct all our experiments on the Pile dataset (Gao et al., 2020)... We consider two types of metrics: 1. validation loss on the Pile, an unweighted version of the pile validation set... Slim Pajama (Soboleva et al., 2023), and a 1 billion token subset of Fine Web (Penedo et al., 2024), 2. zero-shot downstream performance on 6 common-sense reasoning domains from the language model evaluation harness (Gao et al., 2024)...
Hardware Specification Yes All experiments were run on Google Cloud TPUs. On a TPU v3-128, a 1B model can be trained for 60,000 steps in 3.5 days.
Software Dependencies No We ran our experiments on TPUs using the open-source mid GPT library (Zhou et al., 2023), which is based on JAX (Bradbury et al., 2018) and Equinox (Kidger & Garcia, 2021). All experiments were run on Google Cloud TPUs. On a TPU v3-128, a 1B model can be trained for 60,000 steps in 3.5 days. Fitting scaling laws for all domains takes less than 20 seconds over the course of a training run, this amounts to under 19 minutes in additional time spent fitting scaling laws.
Experiment Setup Yes Training. All models were trained for 60,000 steps at batch size 2048 (1B) or 256 (124M), using Adam W (Loshchilov, 2017). We use decoupled weight decay with λ = 10 4, set β2 = 0.05, and otherwise use default hyperparameters as specified by Optax (Deep Mind et al., 2020). For ADO, we fit scaling laws for each domain every 1,000 training steps starting at step 5,000 (we run an empirical sampling strategy for the first 5,000 steps).