DataMan: Data Manager for Pre-training Large Language Models
Authors: Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, Junbo Zhao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments validate our approach, using Data Man to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2Alibaba Group EMAIL EMAIL |
| Pseudocode | No | The paper describes methods and processes, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | No | We will release the code, all models, and the annotated Data Pajama dataset, paving the way for the community to explore the guidelines between data and LLMs further. |
| Open Datasets | Yes | Data Pajama is a curated subset of Slim Pajama, which is itself a subset of Red Pajama. Both Slim Pajama and Red Pajama are released on Hugging Face under the Apache 2.0 License. |
| Dataset Splits | Yes | We measure the perplexity over Slim Pajama s validation set and test set, 500M tokens each. |
| Hardware Specification | Yes | Each model is trained on 32x NVIDIA A800 over 228 GPU hours. |
| Software Dependencies | Yes | We fine-tune the Data Man model using Qwen2-1.5B (Yang et al., 2024a), an advanced open-source 1.5B parameter language model, based on text generation loss. |
| Experiment Setup | Yes | This model is trained using a global batch size of 2048 sequences and a learning rate of 5 10 4 with a cosine learning rate decay to 5 10 5 and a linear warmup for the first 5% of training steps. We use a weight decay of 0.1 and train with Adam (Kingma, 2014) with hyperparameters β = (0.9, 0.95). |