reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Authors: Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the practical value of constructing domains with Web Organizer by training models with the domain mixtures produced by Reg Mix in Section 3. We show how to combine data mixing for topics and formats, and how domains can be used together with quality filters. All our experiments are implemented in the Data Comps LM (DCLM) framework (Li et al., 2024), using the 1b-1x competition pool. We follow best practices and use heuristic filters, followed by deduplication to reduce the 1.6T raw token pool to a base corpus of 200B tokens. From this dataset, we select 29B tokens by sampling according to a domain mixture and train a 1B parameter model. Full details of our experimental setup can be found in Appendix E. We use OLMES (Gu et al., 2024) to evaluate models and their domain mixtures. We use a 5-shot setting on a suite of 9 tasks: MMLU (Hendrycks et al., 2021), Hella Swag (HSwag) (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Wino G) (Sakaguchi et al., 2021), Common Sense QA (CSQA) (Talmor et al., 2019), Social IQa (SIQA) (Sap et al., 2019), ARC-easy/challenge (ARC-e/ARC-c) (Clark et al., 2018), and Open Book QA (OBQA) (Mihaylov et al., 2018). Table 1 shows the results of our main experiments with mixtures optimized for both MMLU and Hella Swag.
Researcher Affiliation	Collaboration	1Princeton Language and Intelligence, Princeton University 2Allen Institute for Artificial Intelligence 3University of California, Berkeley 4Paul G. Allen School of Computer Science & Engineering, University of Washington. Correspondence to: Alexander Wettig <EMAIL>.
Pseudocode	Yes	Algorithm 1 Adaptive search for Reg Mix
Open Source Code	Yes	Website weborganizer.allen.ai Artifacts hf.co/Web Organizer Code Code Creator/Web Organizer. We open-source Web Organizer as a tool for understanding, documenting and curating pre-training data. To encourage future work, we include the code for constructing domains and training domain classifiers, as well as the annotated pre-training corpus.
Open Datasets	Yes	Website weborganizer.allen.ai Artifacts hf.co/Web Organizer Code Code Creator/Web Organizer. We open-source Web Organizer as a tool for understanding, documenting and curating pre-training data. To encourage future work, we include the code for constructing domains and training domain classifiers, as well as the annotated pre-training corpus.
Dataset Splits	Yes	We follow best practices and use heuristic filters, followed by deduplication to reduce the 1.6T raw token pool to a base corpus of 200B tokens. From this dataset, we select 29B tokens by sampling according to a domain mixture and train a 1B parameter model. ... From this 200B token base corpus, we set apart approximately 1B token as a validation set and use the rest for selecting training data.
Hardware Specification	Yes	For the first stage of training, we annotate 1M web pages with Llama-3.1-8B-Instruct, and for the second stage, a subset of 100K web pages is annotated with Llama-3.1-405B-Instruct, using FP8 inference and 8x H100 NVIDIA GPUs. ... The 512 model runs require approximately 360 NVIDIA H100 hours. ... We speed up training by adding torch.compile, making a single training run take 183 NVIDIA H100 hours.
Software Dependencies	Yes	We initialize the classifiers with gte-base-en-v1.5 (Li et al., 2023b) a 140M parameter embedding model... We obtain training data by prompting Llama models to annotate web pages using the prompts described in Appendix A. ...we leverage the SGLang inference framework (Zheng et al., 2024)... The data is tokenized with the GPT-Neo X tokenizer (Black et al., 2022), as used by the DCLM model runs. The model architecture is based on the Llama architecture Touvron et al. (2023), featuring Swi GLU activations (Shazeer, 2020) and Ro PE positional embeddings (Su et al., 2024).
Experiment Setup	Yes	All our experiments are implemented in the Data Comps LM (DCLM) framework (Li et al., 2024), using the 1b-1x competition pool. We follow best practices and use heuristic filters, followed by deduplication to reduce the 1.6T raw token pool to a base corpus of 200B tokens. From this dataset, we select 29B tokens by sampling according to a domain mixture and train a 1B parameter model. ... Small model training: We sample 1B tokens according to each training mixture and train small 50M parameter models on this data. ... The hyperparameters are given in Table 6. Parameter Value Hidden size 512 Intermediate size 1536 Activation function Swi GLU Attention heads 8 Num. blocks 8 Ro PE base frequency 10000 Peak learning rate 3e-3 Cosine cooldown 3e-4 Warmup ratio 10% Adam ̒ s (0.9, 0.95) Batch size 128