reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DataMan: Data Manager for Pre-training Large Language Models

Authors: Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, Junbo Zhao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments validate our approach, using Data Man to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline.
Researcher Affiliation	Collaboration	1Zhejiang University 2Alibaba Group EMAIL EMAIL
Pseudocode	No	The paper describes methods and processes, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	No	We will release the code, all models, and the annotated Data Pajama dataset, paving the way for the community to explore the guidelines between data and LLMs further.
Open Datasets	Yes	Data Pajama is a curated subset of Slim Pajama, which is itself a subset of Red Pajama. Both Slim Pajama and Red Pajama are released on Hugging Face under the Apache 2.0 License.
Dataset Splits	Yes	We measure the perplexity over Slim Pajama s validation set and test set, 500M tokens each.
Hardware Specification	Yes	Each model is trained on 32x NVIDIA A800 over 228 GPU hours.
Software Dependencies	Yes	We fine-tune the Data Man model using Qwen2-1.5B (Yang et al., 2024a), an advanced open-source 1.5B parameter language model, based on text generation loss.
Experiment Setup	Yes	This model is trained using a global batch size of 2048 sequences and a learning rate of 5 10 4 with a cosine learning rate decay to 5 10 5 and a linear warmup for the first 5% of training steps. We use a weight decay of 0.1 and train with Adam (Kingma, 2014) with hyperparameters β = (0.9, 0.95).