Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
Authors: Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that DOMAIN2VEC helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, DOMAIN2VEC achieves the same validation loss on Pile-CC using only 51.5% of the compute required when training on the original mixture of The Pile Dataset. Under equivalent compute budget, DOMAIN2VEC improves downstream performance by an average of 2.83%. We validate the effectiveness of DOMAIN2VEC+DA2 and +Reg Mix in text generation and downstream tasks. Experimental results show that our method can accurately predict the performance of various data mixtures without training proxy models. |
| Researcher Affiliation | Collaboration | 1School of Computer Science, Fudan University, Shanghai, China 2Ritzz-AI. Correspondence to: Howe Tissue (project lead) <EMAIL>. |
| Pseudocode | Yes | In Algorithm 1, we show the pseudo code for acquiring the domain vector for pretraining datasets. In Algorithm 2 and 3, we show the pseudo code for how to use DOMAIN2VEC to find the optimal data mixture, including Distribution Alignment Assumption, and applying DOMAIN2VEC to Reg Mix. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing their source code or a link to a code repository. It mentions using "LM Evaluation Harness (Gao et al., 2024)", but this refers to a third-party tool rather than the authors' own implementation code. |
| Open Datasets | Yes | Our training datasets include C4 (Raffel et al., 2020) and Knowledge Pile (Fei et al., 2024). We select 20 validation datasets from The Pile (Gao et al., 2021) and Red Pajama (Weber et al., 2024). |
| Dataset Splits | Yes | We mix C4 and Knowledge Pile with different data mixtures as the training set as shown in Table 1. ... We select 20 validation datasets from The Pile (Gao et al., 2021) and Red Pajama (Weber et al., 2024). ... We generate 100, 000 data mixtures from a Dirichlet distribution based on the token distribution of these components. Using these mixtures, we predict the optimal data mixture by our proposed two methods. |
| Hardware Specification | No | The paper mentions training LLaMA-like models of various sizes (83M, 1.6B, 106M, 290M, 595M, 1B parameters) and provides architectural parameters in Table 7, but it does not specify any concrete hardware details such as GPU models, CPU types, or cloud computing resources used for these experiments. |
| Software Dependencies | No | The paper mentions using specific tools and optimizers like 'Adam W' and 'Light GBM', and models such as 'LLaMA-like' and 'bge-small-en-v1.5' or 'bge-small-zh-v1.5', but it does not provide specific version numbers for any underlying software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | Both models have a batch size of 1.5M tokens and a maximum sequence length of 4, 096. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with gradient clipping at 1.0. The learning rate linearly warms up to 2e-4 over the first 100 steps, then decays to 2e-5 using a cosine scheduler over 10, 000 steps. More parameters are detailed in Table 7. All models adopt a batch size of 1M tokens and a maximum sequence length of 4, 096. We apply the Adam W (Loshchilov & Hutter, 2017) optimizer with gradient clipping at 1.0. The learning rate linearly warms up to 6e-4 over 1, 000 steps, then decays to 0 using a cosine scheduler at the end of training. |