A Survey on Data Selection for Language Models
Authors: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. |
| Researcher Affiliation | Collaboration | Alon Albalak,* UC Santa Barbara, Synth Labs Yanai Elazar, Allen Institute for AI, University of Washington Sang Michael Xie, Stanford University Shayne Longpre, Massachusetts Institute of Technology Nathan Lambert, Allen Institute for AI Xinyi Wang, UC Santa Barbara Niklas Muennighoff, Contextual AI Bairu Hou, UC Santa Barbara Liangming Pan, UC Santa Barbara Haewon Jeong, UC Santa Barbara Colin Raffel, University of Toronto, Vector Institute Shiyu Chang, UC Santa Barbara Tatsunori Hashimoto, Stanford University William Yang Wang, UC Santa Barbara |
| Pseudocode | No | The paper is a survey of existing literature and does not present novel algorithms in structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper is a survey and describes existing tools and methods developed by other researchers, or in other contexts by some of the authors, rather than providing source code for its own methodology. For example, it mentions "The CCNet pipeline (Wenzek et al., 2020) is a commonly used tool for downloading and cleaning Common Crawl data." but does not release code specific to this survey paper. |
| Open Datasets | Yes | Among these sources is Common Crawl, a collection of around 250 billion webpages scraped from the internet. These webpages amount to around 11 petabytes of data, collected from internet scraping efforts since 2008, with an additional 3-5 billion new web pages being crawled monthly.1 Due to the massive size of pretraining corpora, a common goal of data selection during pretraining is to remove significant quantities of data through a series of filters (Conneau & Lample, 2019; Raffel et al., 2020; Wenzek et al., 2020; Gao et al., 2020; Rae et al., 2021; Lee et al., 2022a) that aim to only retain data that is deemed high-quality . (...) The most popular datasets used for data selection until now have been Red Pajama (Skill-it (Chen et al., 2023c), Slim Pajama (Shen et al., 2024), data mixing laws (Ye et al., 2024)), The Pile (ODM (Albalak et al., 2023a), DSIR (Xie et al., 2023b), Do Re Mi (Xie et al., 2023a), Selection for encoders (Feng et al., 2022)), and C4 (Ds Dm (Engstrom et al., 2024)), each with pros and cons. C4 (Raffel et al., 2020) (750Gb of data) is the oldest and contains only webcrawled data. The Pile (Gao et al., 2020) (800Gb of data) dataset contains slightly more data than C4, but only ~28% of the total dataset comes from web scrapes and the remaining 72% comes from a wide variety of domains including books, scientific articles, and Github. Lastly, Red Pajama (1T tokens) is the newest and the largest of the commonly used datasets, and was originally intended to be an open-source re-creation of the dataset used to train Llama-2. Recently, Red Pajama-2 (30T tokens) was released which may be an even better resource for testing data selection. Red Pajama-2 comes with precomputed quality signals including those implemented for C4 and Refined Web, allowing for experimentation on selection mechanisms under controlled settings. Another beneficial aspect of Red Pajama-2 is that it contains a huge amount of data, allowing researchers to filter the data down to various scales, ranging from billions to trillions of tokens. One such effort is Data Comp (Gadre et al., 2023), a benchmark for multimodal dataset design. Data Comp provides two tracks for participation: the filtering track, and the bring-your-own-data track. The filtering track is of particular interest as it gives participants a common dataset (12.8B image-text pairs from Common Crawl), and the participants goal is to determine the best subset to train a model on. Additionally, Gadre et al. (2023) provide over 300 baseline experiments, and find that, similar to the language-only setting, a smaller, more selective dataset can lead to models that generalize better than larger datasets. Of course, this still comes with the same limitations as all previously discussed sections, where the exact evaluation setting plays a large role in what data is best . Specifically, Data Comp evaluates models abilities to perform image classification and retrieval tasks, but does not evaluate generative capabilities. This suggests that distribution matching methods may fare well on Data Comp, while an evaluation of generative capabilities may prefer data selection methods that favor distribution diversification. Data Perf (Mazumder et al., 2023) provides another entry for data selection research. The Data Perf benchmark contains 4 data selection settings: vision, speech, debugging, and language. The various settings allow researchers to test their algorithms for selection, augmentation, quality assesment, and cleaning. |
| Dataset Splits | No | This paper is a survey and review of existing literature on data selection methods. It does not conduct its own experiments or define dataset splits for its own work. While it mentions other papers that use splits (e.g., "Lee et al. (2022a) remove only the exact substring..." and "use a temporal split when dividing data into training and evaluation"), this information pertains to external research and not the methodology of this specific paper. |
| Hardware Specification | No | The paper is a survey of existing literature and does not describe hardware used for its own experimental methodology. While it mentions hardware costs in a general sense, e.g., "For reference, training a 1 billion parameter model on 50 billion tokens takes roughly 5 days to train on 8 A100 GPUs using GPT-Neo X (Andonian et al., 2023; Albalak et al., 2023a).", this is an illustrative example of compute for *other* models, not hardware used by the authors for this paper's research. |
| Software Dependencies | No | This paper is a survey and does not present its own methodology requiring specific software dependencies with version numbers for replication. It mentions various tools like 'fastText', 'langdetect', 'cld3', 'CCNet pipeline', 'Dolma toolkit', 'Data Trove', 'DSIR', 'Exact Substr algorithm', 'Sem De Dup', but these are tools and frameworks used or described in *other* works, not dependencies for replicating the specific research presented in this survey paper. |
| Experiment Setup | No | This paper is a survey and review of existing literature, and as such, it does not detail an experimental setup, hyperparameters, or training configurations for its own research. |