Data Selection via Optimal Control for Language Models

Authors: Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, Minlie Huang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we adopt PDS to select data from Commmon Crawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which helps mitigate the quick exhaustion of available web-crawled corpora. Our code, model, and data can be found at https: //github.com/microsoft/LMOps/tree/main/data_selection.
Researcher Affiliation Collaboration Yuxian Gu1,2 , Li Dong2, Hongning Wang1 Yaru Hao2, Qingxiu Dong3 , Furu Wei2, Minlie Huang1 1The Co AI Group, Tsinghua University 2Microsoft Research 3Peking University
Pseudocode Yes Algorithm 1 PMP-Solver Input: LM learning rate η. Outer loop learning rate α. Outer loop epochs To. Training data before selection D. Downstream loss J(θ). Training steps T. Proj[ ] that projects a point in R|D| to U. Model initialization θ0. Output: Data quality scores γ .
Open Source Code Yes Our code, model, and data can be found at https: //github.com/microsoft/LMOps/tree/main/data_selection.
Open Datasets Yes We use the Common Crawl split from Redpajama (Together, 2023) as D to exclude the influence of domain weights (Xie et al., 2024). For the downstream loss J(θ), we compute the LM s loss on the training split of LIMA (Zhou et al., 2024), a high-quality dataset consisting of 1,030 diverse instruction-response pairs that cover a wide range of downstream scenarios. Our evaluation is conducted on various downstream datasets other than LIMA to avoid over-fitting. We evaluate the LMs 0-shot accuracy on the downstream test datasets used in OLMo (Groeneveld et al., 2024) and their 0-shot performance on MMLU (Hendrycks et al., 2021). We also report the LM s language modeling loss on a subset of DCLM (Li et al., 2024), a high-quality corpus curated with complex pipelines and human heuristics, to verify that models trained on D retain diversity and long-tail knowledge coverage.
Dataset Splits No The paper mentions using a "Common Crawl split from Redpajama" as source data, sampling "160K instances uniformly sampled from D" for a proxy dataset, and evaluating on "downstream test datasets" (OLMo, MMLU) and a "subset of DCLM". It also mentions splitting 10% of LIMA for validation for a baseline comparison (RHO-Loss). However, it does not explicitly state the train/validation/test splits for the main pre-training data (the 50B tokens selected from the 125B CC corpus) used for their primary LM experiments.
Hardware Specification No The paper discusses 'GPU FLOPs' in Table 4 for complexity analysis, but does not provide specific details on the GPU models, CPU models, or any other hardware specifications used for running the experiments.
Software Dependencies No The paper mentions using "PyTorch (Paszke et al., 2019)", "Adam W (Loshchilov & Hutter, 2019)", and the "LM-evaluation-harness library (Gao et al., 2024)". However, it does not provide specific version numbers for these software components, which are required for reproducible ancillary software details.
Experiment Setup Yes PDS. To compute the data quality scores from PMP, we adopt a 160M proxy LM. Dprx consists of 160K instances uniformly sampled from D. We first pre-train the proxy LM on D for 50K steps and then select checkpoints at [10K, 20K, 30K, 40K, 50K] steps. Initialized from these checkpoints, the proxy LM undergoes inner loops with η = 0.008 over T prx = 100 steps with a mini-batch size of 256. γ is updated for one outer loop epoch with α = 1. For the data scorer, we fine-tune a 125M Fairseq-Dense model (Artetxe et al., 2022) along with the linear head, using the objective in Eq. (8). The training details for the data scorer can be found in Appendix G.2. For Data Selection, we set δ = 0.1, r = 0.4, with further hyper-parameter exploration provided in Appendix I.5. Pre-Training. We pre-train all LMs for 100k steps, using a batch size of 512 and a max input length of 1,024, resulting in roughly 50B tokens. In Section 3.2, we select a 50B-token dataset from a CC corpus containing 125B tokens to assess how different data selection methods improve LM learning given a sufficiently large D. In Section 3.3 (Data-Constrained Setting), we also analyze the effectiveness of PDS when D is limited in size. See Appendix G.3 for more pre-training details.