Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
Authors: Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the practical value of constructing domains with Web Organizer by training models with the domain mixtures produced by Reg Mix in Section 3. We show how to combine data mixing for topics and formats, and how domains can be used together with quality filters. All our experiments are implemented in the Data Comps LM (DCLM) framework (Li et al., 2024), using the 1b-1x competition pool. We follow best practices and use heuristic filters, followed by deduplication to reduce the 1.6T raw token pool to a base corpus of 200B tokens. From this dataset, we select 29B tokens by sampling according to a domain mixture and train a 1B parameter model. Full details of our experimental setup can be found in Appendix E. We use OLMES (Gu et al., 2024) to evaluate models and their domain mixtures. We use a 5-shot setting on a suite of 9 tasks: MMLU (Hendrycks et al., 2021), Hella Swag (HSwag) (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Wino G) (Sakaguchi et al., 2021), Common Sense QA (CSQA) (Talmor et al., 2019), Social IQa (SIQA) (Sap et al., 2019), ARC-easy/challenge (ARC-e/ARC-c) (Clark et al., 2018), and Open Book QA (OBQA) (Mihaylov et al., 2018). Table 1 shows the results of our main experiments with mixtures optimized for both MMLU and Hella Swag. |
| Researcher Affiliation | Collaboration | 1Princeton Language and Intelligence, Princeton University 2Allen Institute for Artificial Intelligence 3University of California, Berkeley 4Paul G. Allen School of Computer Science & Engineering, University of Washington. Correspondence to: Alexander Wettig <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adaptive search for Reg Mix |
| Open Source Code | Yes | Website weborganizer.allen.ai Artifacts hf.co/Web Organizer Code Code Creator/Web Organizer. We open-source Web Organizer as a tool for understanding, documenting and curating pre-training data. To encourage future work, we include the code for constructing domains and training domain classifiers, as well as the annotated pre-training corpus. |
| Open Datasets | Yes | Website weborganizer.allen.ai Artifacts hf.co/Web Organizer Code Code Creator/Web Organizer. We open-source Web Organizer as a tool for understanding, documenting and curating pre-training data. To encourage future work, we include the code for constructing domains and training domain classifiers, as well as the annotated pre-training corpus. |
| Dataset Splits | Yes | We follow best practices and use heuristic filters, followed by deduplication to reduce the 1.6T raw token pool to a base corpus of 200B tokens. From this dataset, we select 29B tokens by sampling according to a domain mixture and train a 1B parameter model. ... From this 200B token base corpus, we set apart approximately 1B token as a validation set and use the rest for selecting training data. |
| Hardware Specification | Yes | For the first stage of training, we annotate 1M web pages with Llama-3.1-8B-Instruct, and for the second stage, a subset of 100K web pages is annotated with Llama-3.1-405B-Instruct, using FP8 inference and 8x H100 NVIDIA GPUs. ... The 512 model runs require approximately 360 NVIDIA H100 hours. ... We speed up training by adding torch.compile, making a single training run take 183 NVIDIA H100 hours. |
| Software Dependencies | Yes | We initialize the classifiers with gte-base-en-v1.5 (Li et al., 2023b) a 140M parameter embedding model... We obtain training data by prompting Llama models to annotate web pages using the prompts described in Appendix A. ...we leverage the SGLang inference framework (Zheng et al., 2024)... The data is tokenized with the GPT-Neo X tokenizer (Black et al., 2022), as used by the DCLM model runs. The model architecture is based on the Llama architecture Touvron et al. (2023), featuring Swi GLU activations (Shazeer, 2020) and Ro PE positional embeddings (Su et al., 2024). |
| Experiment Setup | Yes | All our experiments are implemented in the Data Comps LM (DCLM) framework (Li et al., 2024), using the 1b-1x competition pool. We follow best practices and use heuristic filters, followed by deduplication to reduce the 1.6T raw token pool to a base corpus of 200B tokens. From this dataset, we select 29B tokens by sampling according to a domain mixture and train a 1B parameter model. ... Small model training: We sample 1B tokens according to each training mixture and train small 50M parameter models on this data. ... The hyperparameters are given in Table 6. Parameter Value Hidden size 512 Intermediate size 1536 Activation function Swi GLU Attention heads 8 Num. blocks 8 Ro PE base frequency 10000 Peak learning rate 3e-3 Cosine cooldown 3e-4 Warmup ratio 10% Adam ̒ s (0.9, 0.95) Batch size 128 |