Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws
Authors: Yiding Jiang, Allan Zhou, Zhili Feng, Sadhika Malladi, Zico Kolter
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that ADO can achieve comparable or better performance than prior methods while maintaining computational efficiency across different computation scales, offering a practical solution for dynamically adjusting data distribution without sacrificing flexibility or increasing costs. |
| Researcher Affiliation | Academia | Carnegie Mellon University Stanford University Princeton University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Adaptive Data Optimization (ADO) 1: Input: prior µ K, update interval tupdate, warmup duration twarmup, γ1, γ2, s, δmin |
| Open Source Code | Yes | The code for this work is available at https://github.com/yidingjiang/ado. |
| Open Datasets | Yes | We conduct all our experiments on the Pile dataset (Gao et al., 2020) with decoder-only transformer language models (Vaswani et al., 2017) of varying sizes. We consider two types of metrics: 1. validation loss on the Pile, an unweighted version of the pile validation set where each domain receives equal probability, Slim Pajama (Soboleva et al., 2023), and a 1 billion token subset of Fine Web (Penedo et al., 2024) |
| Dataset Splits | Yes | We conduct all our experiments on the Pile dataset (Gao et al., 2020)... We consider two types of metrics: 1. validation loss on the Pile, an unweighted version of the pile validation set... Slim Pajama (Soboleva et al., 2023), and a 1 billion token subset of Fine Web (Penedo et al., 2024), 2. zero-shot downstream performance on 6 common-sense reasoning domains from the language model evaluation harness (Gao et al., 2024)... |
| Hardware Specification | Yes | All experiments were run on Google Cloud TPUs. On a TPU v3-128, a 1B model can be trained for 60,000 steps in 3.5 days. |
| Software Dependencies | No | We ran our experiments on TPUs using the open-source mid GPT library (Zhou et al., 2023), which is based on JAX (Bradbury et al., 2018) and Equinox (Kidger & Garcia, 2021). All experiments were run on Google Cloud TPUs. On a TPU v3-128, a 1B model can be trained for 60,000 steps in 3.5 days. Fitting scaling laws for all domains takes less than 20 seconds over the course of a training run, this amounts to under 19 minutes in additional time spent fitting scaling laws. |
| Experiment Setup | Yes | Training. All models were trained for 60,000 steps at batch size 2048 (1B) or 256 (124M), using Adam W (Loshchilov, 2017). We use decoupled weight decay with λ = 10 4, set β2 = 0.05, and otherwise use default hyperparameters as specified by Optax (Deep Mind et al., 2020). For ADO, we fit scaling laws for each domain every 1,000 training steps starting at step 5,000 (we run an empirical sampling strategy for the first 5,000 steps). |