Metadata Conditioning Accelerates Language Model Pre-training
Authors: Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that Me Co enables a 1.6B model to achieve the same average downstream performance as a standard pre-trained model using 33% less training data. Me Co exhibits consistent gains across model scales (600M, 1.6B, 3B, and 8B) and pre-training corpora (C4, Refined Web, and DCLM). Table 1: Our main experimental results of pre-training a 1.6B language model on 160B tokens from DCLM. Me Co significantly outperforms standard pre-training and achieves equivalent average performance to the 240B-token baseline while using 33% less data. |
| Researcher Affiliation | Academia | 1Princeton Language and Intelligence, Princeton University. Correspondence to: Tianyu Gao <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes in narrative text and figures (e.g., Figure 1 illustrates data comparison), but it does not contain any formally structured pseudocode or algorithm blocks with labeled steps. |
| Open Source Code | Yes | Our models, data, and code are available at https://github.com/ princeton-pli/Me Co. |
| Open Datasets | Yes | Pre-training data. We use the best-performing open-source pre-training corpus, DCLM-Baseline (Li et al., 2024), for our main experiments. Additionally, we conduct experiments with two other data sources: a reproduction of Refined Web (Penedo et al., 2023) from Li et al. (2024) and the C4 dataset (Raffel et al., 2020). |
| Dataset Splits | Yes | Evaluation. We adopt the OLMES suite (Gu et al., 2024) for evaluation, which includes the following tasks: MMLU (Hendrycks et al., 2021), ARC-Easy (ARC-e; Clark et al., 2018), ARC-Challenge (ARC-c; Clark et al., 2018), Commonsense QA (CSQA; Talmor et al., 2019), Hella Swag (HSwag; Zellers et al., 2019), Open Book QA (OBQA; Mihaylov et al., 2018), PIQA (Bisk et al., 2020), Social IQA (SIQA; Sap et al., 2019), and Wino Grande (WG; Sakaguchi et al., 2021). We also add the popular Truthful QA dataset (Tru QA; Lin et al., 2022). Throughout the paper, we report the average performance across all 10 tasks as Avg. . Unless specified, we always report 5-shot in-context learning results. OLMES enhances evaluation reliability by offering three key features: (1) it provides manually-curated in-context examples for each task; (2) it evaluates with both a multiple-choice format and a cloze format, and takes the best of two; (3) it applies ablated calibration method (Brown et al., 2020; Holtzman et al., 2021) to each individual task. During evaluation, we sample 1,000 examples for each task, which improves efficiency while providing the same reliable results as full evaluation. A.3. Cooldown details: To ensure the cooldown stage does not see repeated data as the conditional training stage, we use a different subset of data for cooldown for all our DCLM experiments. |
| Hardware Specification | Yes | Our main models (1.6B, 160B tokens) take roughly 2 days to train on 32 H100 GPUs. Table 9: Resources required to train the models in our experiments (H100 GPU hours). |
| Software Dependencies | No | We utilize the Llama (Touvron et al., 2023a;b; Dubey et al., 2024) version of the Transformer architecture (Vaswani et al., 2017) and the Llama-3 tokenizer for all our experiments. We employ standard optimization settings for language models, i.e., Adam W optimizer and cosine learning rate schedule. We follow Li et al. (2024) for hyperparameters and the details can be found in A.1. The paper mentions specific architectures (Llama, Transformer) and an optimizer (AdamW), but does not specify software versions for programming languages, libraries (e.g., PyTorch, TensorFlow), or other dependencies. |
| Experiment Setup | Yes | A.1. Hyperparameters Table 6 shows the hyperparameter settings used in our experiments. We follow Li et al. (2024) for the high learning rate and weight decay except for the 8B model, which requires a lower learning rate for numerical stability. Hyperparameters Values Optimizer Adam W (β1 = 0.9, β2 = 0.95) Learning rate 3e-3 (5e-4 for the 8B model) Weight decay 0.033 (0.1 for the 8B model) Batch size 4M tokens Warmup 5% linear warmup Schedule Cosine decay to 10% of the peak learning rate Seq length Pack to 8192 tokens |