DataMan: Data Manager for Pre-training Large Language Models

Authors: Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, Junbo Zhao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments validate our approach, using Data Man to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline.
Researcher Affiliation Collaboration 1Zhejiang University 2Alibaba Group EMAIL EMAIL
Pseudocode No The paper describes methods and processes, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code No We will release the code, all models, and the annotated Data Pajama dataset, paving the way for the community to explore the guidelines between data and LLMs further.
Open Datasets Yes Data Pajama is a curated subset of Slim Pajama, which is itself a subset of Red Pajama. Both Slim Pajama and Red Pajama are released on Hugging Face under the Apache 2.0 License.
Dataset Splits Yes We measure the perplexity over Slim Pajama s validation set and test set, 500M tokens each.
Hardware Specification Yes Each model is trained on 32x NVIDIA A800 over 228 GPU hours.
Software Dependencies Yes We fine-tune the Data Man model using Qwen2-1.5B (Yang et al., 2024a), an advanced open-source 1.5B parameter language model, based on text generation loss.
Experiment Setup Yes This model is trained using a global batch size of 2048 sequences and a learning rate of 5 10 4 with a cosine learning rate decay to 5 10 5 and a linear warmup for the first 5% of training steps. We use a weight decay of 0.1 and train with Adam (Kingma, 2014) with hyperparameters β = (0.9, 0.95).