Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Authors: Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that DOMAIN2VEC helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, DOMAIN2VEC achieves the same validation loss on Pile-CC using only 51.5% of the compute required when training on the original mixture of The Pile Dataset. Under equivalent compute budget, DOMAIN2VEC improves downstream performance by an average of 2.83%. We validate the effectiveness of DOMAIN2VEC+DA2 and +Reg Mix in text generation and downstream tasks. Experimental results show that our method can accurately predict the performance of various data mixtures without training proxy models.
Researcher Affiliation Collaboration 1School of Computer Science, Fudan University, Shanghai, China 2Ritzz-AI. Correspondence to: Howe Tissue (project lead) <EMAIL>.
Pseudocode Yes In Algorithm 1, we show the pseudo code for acquiring the domain vector for pretraining datasets. In Algorithm 2 and 3, we show the pseudo code for how to use DOMAIN2VEC to find the optimal data mixture, including Distribution Alignment Assumption, and applying DOMAIN2VEC to Reg Mix.
Open Source Code No The paper does not contain an explicit statement about releasing their source code or a link to a code repository. It mentions using "LM Evaluation Harness (Gao et al., 2024)", but this refers to a third-party tool rather than the authors' own implementation code.
Open Datasets Yes Our training datasets include C4 (Raffel et al., 2020) and Knowledge Pile (Fei et al., 2024). We select 20 validation datasets from The Pile (Gao et al., 2021) and Red Pajama (Weber et al., 2024).
Dataset Splits Yes We mix C4 and Knowledge Pile with different data mixtures as the training set as shown in Table 1. ... We select 20 validation datasets from The Pile (Gao et al., 2021) and Red Pajama (Weber et al., 2024). ... We generate 100, 000 data mixtures from a Dirichlet distribution based on the token distribution of these components. Using these mixtures, we predict the optimal data mixture by our proposed two methods.
Hardware Specification No The paper mentions training LLaMA-like models of various sizes (83M, 1.6B, 106M, 290M, 595M, 1B parameters) and provides architectural parameters in Table 7, but it does not specify any concrete hardware details such as GPU models, CPU types, or cloud computing resources used for these experiments.
Software Dependencies No The paper mentions using specific tools and optimizers like 'Adam W' and 'Light GBM', and models such as 'LLaMA-like' and 'bge-small-en-v1.5' or 'bge-small-zh-v1.5', but it does not provide specific version numbers for any underlying software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes Both models have a batch size of 1.5M tokens and a maximum sequence length of 4, 096. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with gradient clipping at 1.0. The learning rate linearly warms up to 2e-4 over the first 100 steps, then decays to 2e-5 using a cosine scheduler over 10, 000 steps. More parameters are detailed in Table 7. All models adopt a batch size of 1M tokens and a maximum sequence length of 4, 096. We apply the Adam W (Loshchilov & Hutter, 2017) optimizer with gradient clipping at 1.0. The learning rate linearly warms up to 6e-4 over 1, 000 steps, then decays to 0 using a cosine scheduler at the end of training.