reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM Data Selection and Utilization via Dynamic Bi-level Optimization

Authors: Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model s data preferences evolve throughout training, providing new insights into the data preference of the model during training.
Researcher Affiliation	Collaboration	1School of Artificial Intelligence, University of Chinese Academy of Sciences 2The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences 3Huawei Noah s Ark Lab 4College of Intelligence and Computing, Tianjin University 5Nanyang Technological University.
Pseudocode	No	The paper describes the approach using prose and mathematical formulations (e.g., Equation 1, 2, 3), and provides a framework diagram in Figure 1. There are no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository for their work. It only references a third-party framework (lm-evaluation-harness) used in their evaluation.
Open Datasets	Yes	To evaluate different data selection methods, we utilize the training data selected from the popular dataset Slim Pajama (Soboleva et al., 2023), which is the largest multi-corpora, open-source dataset for training large language models... URL https://huggingface.co/ datasets/cerebras/Slim Pajama-627B. ...we adopt LAMBADA (Paperno et al., 2016) as the validation set...
Dataset Splits	Yes	Following the principles of the scaling law and the Qu Rating (Wettig et al., 2024) framework, a total of 30 billion tokens are selected by different methods to train the model... we adopt LAMBADA (Paperno et al., 2016) as the validation set, which is a widely-used language modeling task and often serves as a validation task for language model pre-training...
Hardware Specification	No	The paper mentions 'GPU memory constraints' when discussing micro batch size, but does not provide any specific details about the GPU models, CPU models, or any other hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions using the 'Adam optimizer' and evaluating with the 'lm-evaluation-harness framework', but it does not specify version numbers for any software components, programming languages, or libraries used in the implementation.
Experiment Setup	Yes	In the training process, a global batch size of 4 million tokens was utilized. The training was completed in approximately 7500 steps. The learning rate was set at 5 10 4. The Adam optimizer was used, with the hyperparameters configured as β1 = 0.9, β2 = 0.95, ϵ = 10 8. The architecture details of the pre-training models with 370M and 1.3B parameters are presented in Table 7.