Aioli: A Unified Optimization Framework for Language Model Data Mixing
Authors: Mayee Chen, Michael Hu, Nicholas Lourie, Kyunghyun Cho, Christopher Re
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate AIOLI in two settings by training 160M models on various combinations of data sources from Slim Pajama (Soboleva et al., 2023) (Section 6). First, we compare AIOLI to existing data mixing methods and find that AIOLI consistently outperforms stratified sampling on all 6 datasets, by an average of 0.274 and up to 0.439 points in test perplexity. |
| Researcher Affiliation | Collaboration | 1 Computer Science Department, Stanford University; 2 Center for Data Science, NYU; 3 Computer Science Department, NYU; 4 Prescient Design, Genentech |
| Pseudocode | Yes | Algorithm 1 AIOLI Algorithm 2 LEARNPARAMS |
| Open Source Code | No | The paper does not explicitly state that the source code for the methodology described is publicly available or provide a link to a code repository. |
| Open Datasets | Yes | We use a sampled version of Slim Pajama (Soboleva et al., 2023; Yoon, 2023), a pre-processed version of the Red Pajama pretraining dataset (Together.ai, 2023). |
| Dataset Splits | Yes | To obtain a test set, we shuffle and split the validation set from Slim Pajama-6B (Soboleva et al., 2023; Yoon, 2023) in half. |
| Hardware Specification | Yes | For the m=2,3 settings, experiments were run on a NVIDIA RTX 6000 Ada Generation GPU. For the m=7 setting, experiments were run on a NVIDIA A100 80 GB GPU. |
| Software Dependencies | No | The paper mentions software like 'PyTorch' and 'Flash Attention' but does not provide specific version numbers for any key software components. |
| Experiment Setup | Yes | We train 160M parameter GPT-style decoder-only LLMs with batch size 8 and context length 2048. All settings use Flash Attention (Dao et al., 2022), batch size of 8, context size of 2048, and cosine learning rate decay from a starting learning rate of 5e-5 to 1e-5 with 500 steps of learning rate warmup. |