LLM Data Selection and Utilization via Dynamic Bi-level Optimization
Authors: Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model s data preferences evolve throughout training, providing new insights into the data preference of the model during training. |
| Researcher Affiliation | Collaboration | 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences 3Huawei Noah s Ark Lab 4College of Intelligence and Computing, Tianjin University 5Nanyang Technological University. |
| Pseudocode | No | The paper describes the approach using prose and mathematical formulations (e.g., Equation 1, 2, 3), and provides a framework diagram in Figure 1. There are no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository for their work. It only references a third-party framework (lm-evaluation-harness) used in their evaluation. |
| Open Datasets | Yes | To evaluate different data selection methods, we utilize the training data selected from the popular dataset Slim Pajama (Soboleva et al., 2023), which is the largest multi-corpora, open-source dataset for training large language models... URL https://huggingface.co/ datasets/cerebras/Slim Pajama-627B. ...we adopt LAMBADA (Paperno et al., 2016) as the validation set... |
| Dataset Splits | Yes | Following the principles of the scaling law and the Qu Rating (Wettig et al., 2024) framework, a total of 30 billion tokens are selected by different methods to train the model... we adopt LAMBADA (Paperno et al., 2016) as the validation set, which is a widely-used language modeling task and often serves as a validation task for language model pre-training... |
| Hardware Specification | No | The paper mentions 'GPU memory constraints' when discussing micro batch size, but does not provide any specific details about the GPU models, CPU models, or any other hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer' and evaluating with the 'lm-evaluation-harness framework', but it does not specify version numbers for any software components, programming languages, or libraries used in the implementation. |
| Experiment Setup | Yes | In the training process, a global batch size of 4 million tokens was utilized. The training was completed in approximately 7500 steps. The learning rate was set at 5 10 4. The Adam optimizer was used, with the hyperparameters configured as β1 = 0.9, β2 = 0.95, ϵ = 10 8. The architecture details of the pre-training models with 370M and 1.3B parameters are presented in Table 7. |