Compute-Constrained Data Selection

Authors: Junjie Oscar Yin, Alexander Rush

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute. Interestingly we find that many powerful data selection methods are almost never compute-optimal, and that cheaper data selection alternatives dominate both from a theoretical and empirical perspective. For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively. We train over 600 models, ranging from 7 to 70 billion parameters, across 6 data selection methods and 3 downstream tasks, recording final task performances for each.
Researcher Affiliation Academia Junjie Oscar Yin Johns Hopkins University EMAIL Alexander M. Rush Cornell University EMAIL
Pseudocode No The paper describes various data selection methods and their computational costs, as well as a parametric model for performance. However, it does not include any clearly labeled pseudocode or algorithm blocks for any of the methods or processes described.
Open Source Code Yes We hope that this framework and setting can motivate further research into cheaper data selection methods that can produce better models with less compute. Codebase and datasets to reproduce our results are available at https://github.com/oseyosey/CCDS.
Open Datasets Yes Datasets We follow Wang et al. (2023) and curate a representative sample of instruction-tuned datasets as listed in Table 8. This includes: (1) datasets generated by researchers from existing NLP datasets, such as COT (Wei et al., 2022) and Flan V2 (Longpre et al., 2023) ; (2) datasets written by humans from scratch specifically for instruction tuning, including Dolly (Conover et al., 2023) and Open Assistant 1 (K opf et al., 2024). For evaluating the model we run on three challenging but different downstream tasks. These include: the Massive Multitask Language Understanding (MMLU, Hendrycks et al. (2020)) dataset...; Big-Bench-Hard (BBH, Suzgun et al. (2022))...; and Instruction Following Evaluation (IFEval, Zhou et al. (2023)).
Dataset Splits Yes The finetuning data budget is fixed as a percentage of the total finetuning tokens: {2.5, 5, 10, 25, 50, 100}%, across 3 target tasks. For MMLU, we report 5-shot accuracy; for BBH, we report 3-shot exact match score; and for IFEval, we report 0-shot accuracy. Each subtask comes with few-shot examples or sample responses, which are used as validation set V for data selection and as few-shot in-context learning demonstration in evaluation.
Hardware Specification No The paper mentions "measured as single A100 GPU hours" in Table 6, but this refers to the asymptotic complexity and wall-clock runtime for LESS (Xia et al., 2024), which is a cited work, not necessarily the specific hardware used by the authors for their own experimental runs. No specific GPU models, CPU models, or detailed computer specifications are provided for the authors' experiments.
Software Dependencies No The paper mentions using the Lo RA finetuning method, Adam W optimizer, and BFloat16 precision. However, it does not provide specific version numbers for any software libraries (e.g., PyTorch, Hugging Face Transformers), programming languages (e.g., Python), or CUDA versions, which are necessary for reproducible setup.
Experiment Setup Yes All experiments were conducted with parameter-efficient finetuning method Lo RA Hu et al. (2021). For the Lo RA adapter, we specified a rank of 128, an α value of 512, and a dropout rate of 0.1 and applied it across all attention matrices. We follow standard practices in LLM finetuning Wang et al. (2023); Ivison et al. (2023) and use the Adam W optimizer with beta-parameters (β1, β2) = (0.9, 0.99). The learning rate is set to 2e-5 for the 7B/8B/13B models and 1e-5 for the 70B models. For data budget {2.5%,5%}, we double the learning rate to ensure convergence in loss. For all experiments, we use a warmup ratio of 0.03, BFloat16 precision, and an effective batch size of 128.