reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data Selection via Optimal Control for Language Models

Authors: Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, Minlie Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we adopt PDS to select data from Commmon Crawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which helps mitigate the quick exhaustion of available web-crawled corpora. Our code, model, and data can be found at https: //github.com/microsoft/LMOps/tree/main/data_selection.
Researcher Affiliation	Collaboration	Yuxian Gu1,2 , Li Dong2, Hongning Wang1 Yaru Hao2, Qingxiu Dong3 , Furu Wei2, Minlie Huang1 1The Co AI Group, Tsinghua University 2Microsoft Research 3Peking University
Pseudocode	Yes	Algorithm 1 PMP-Solver Input: LM learning rate η. Outer loop learning rate α. Outer loop epochs To. Training data before selection D. Downstream loss J(θ). Training steps T. Proj[ ] that projects a point in R\|D\| to U. Model initialization θ0. Output: Data quality scores γ .
Open Source Code	Yes	Our code, model, and data can be found at https: //github.com/microsoft/LMOps/tree/main/data_selection.
Open Datasets	Yes	We use the Common Crawl split from Redpajama (Together, 2023) as D to exclude the influence of domain weights (Xie et al., 2024). For the downstream loss J(θ), we compute the LM s loss on the training split of LIMA (Zhou et al., 2024), a high-quality dataset consisting of 1,030 diverse instruction-response pairs that cover a wide range of downstream scenarios. Our evaluation is conducted on various downstream datasets other than LIMA to avoid over-fitting. We evaluate the LMs 0-shot accuracy on the downstream test datasets used in OLMo (Groeneveld et al., 2024) and their 0-shot performance on MMLU (Hendrycks et al., 2021). We also report the LM s language modeling loss on a subset of DCLM (Li et al., 2024), a high-quality corpus curated with complex pipelines and human heuristics, to verify that models trained on D retain diversity and long-tail knowledge coverage.
Dataset Splits	No	The paper mentions using a "Common Crawl split from Redpajama" as source data, sampling "160K instances uniformly sampled from D" for a proxy dataset, and evaluating on "downstream test datasets" (OLMo, MMLU) and a "subset of DCLM". It also mentions splitting 10% of LIMA for validation for a baseline comparison (RHO-Loss). However, it does not explicitly state the train/validation/test splits for the main pre-training data (the 50B tokens selected from the 125B CC corpus) used for their primary LM experiments.
Hardware Specification	No	The paper discusses 'GPU FLOPs' in Table 4 for complexity analysis, but does not provide specific details on the GPU models, CPU models, or any other hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions using "PyTorch (Paszke et al., 2019)", "Adam W (Loshchilov & Hutter, 2019)", and the "LM-evaluation-harness library (Gao et al., 2024)". However, it does not provide specific version numbers for these software components, which are required for reproducible ancillary software details.
Experiment Setup	Yes	PDS. To compute the data quality scores from PMP, we adopt a 160M proxy LM. Dprx consists of 160K instances uniformly sampled from D. We first pre-train the proxy LM on D for 50K steps and then select checkpoints at [10K, 20K, 30K, 40K, 50K] steps. Initialized from these checkpoints, the proxy LM undergoes inner loops with η = 0.008 over T prx = 100 steps with a mini-batch size of 256. γ is updated for one outer loop epoch with α = 1. For the data scorer, we fine-tune a 125M Fairseq-Dense model (Artetxe et al., 2022) along with the linear head, using the objective in Eq. (8). The training details for the data scorer can be found in Appendix G.2. For Data Selection, we set δ = 0.1, r = 0.4, with further hyper-parameter exploration provided in Appendix I.5. Pre-Training. We pre-train all LMs for 100k steps, using a batch size of 512 and a max input length of 1,024, resulting in roughly 50B tokens. In Section 3.2, we select a 50B-token dataset from a CC corpus containing 125B tokens to assess how different data selection methods improve LM learning given a sufficiently large D. In Section 3.3 (Data-Constrained Setting), we also analyze the effectiveness of PDS when D is limited in size. See Appendix G.3 for more pre-training details.