reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aioli: A Unified Optimization Framework for Language Model Data Mixing

Authors: Mayee Chen, Michael Hu, Nicholas Lourie, Kyunghyun Cho, Christopher Re

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate AIOLI in two settings by training 160M models on various combinations of data sources from Slim Pajama (Soboleva et al., 2023) (Section 6). First, we compare AIOLI to existing data mixing methods and find that AIOLI consistently outperforms stratified sampling on all 6 datasets, by an average of 0.274 and up to 0.439 points in test perplexity.
Researcher Affiliation	Collaboration	1 Computer Science Department, Stanford University; 2 Center for Data Science, NYU; 3 Computer Science Department, NYU; 4 Prescient Design, Genentech
Pseudocode	Yes	Algorithm 1 AIOLI Algorithm 2 LEARNPARAMS
Open Source Code	No	The paper does not explicitly state that the source code for the methodology described is publicly available or provide a link to a code repository.
Open Datasets	Yes	We use a sampled version of Slim Pajama (Soboleva et al., 2023; Yoon, 2023), a pre-processed version of the Red Pajama pretraining dataset (Together.ai, 2023).
Dataset Splits	Yes	To obtain a test set, we shuffle and split the validation set from Slim Pajama-6B (Soboleva et al., 2023; Yoon, 2023) in half.
Hardware Specification	Yes	For the m=2,3 settings, experiments were run on a NVIDIA RTX 6000 Ada Generation GPU. For the m=7 setting, experiments were run on a NVIDIA A100 80 GB GPU.
Software Dependencies	No	The paper mentions software like 'PyTorch' and 'Flash Attention' but does not provide specific version numbers for any key software components.
Experiment Setup	Yes	We train 160M parameter GPT-style decoder-only LLMs with batch size 8 and context length 2048. All settings use Flash Attention (Dao et al., 2022), batch size of 8, context size of 2048, and cosine learning rate decay from a starting learning rate of 5e-5 to 1e-5 with 500 steps of learning rate warmup.