Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Authors: Kashun Shum, Yuzhen Huang, Hongjian Zou, Qi Ding, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PRESELECT surpass the performance of the vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PRESELECT significantly outperforms other competitive data selection baselines... 3. Experiments
Researcher Affiliation Collaboration 1HKUST 2Vivo AI Lab. Correspondence to: Kashun Shum <EMAIL>, Junxian He <EMAIL>.
Pseudocode No The paper describes the method conceptually and with equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/Pre Select.
Open Datasets Yes We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/Pre Select. For a fair comparison and ease of processing, we follow and directly use the large data pool created in Li et al. (2024a). Concretely, this data pool utilized a version of Refined Web that undergoes processing through resiliparse text extraction (Bevendorff et al., 2018), Refined Web s heuristic filtering (Penedo et al., 2024d) and deduplication using bloom filters (Soldaini et al., 2024), solely filtered from Common Crawl. We also apply PRESELECT on the C4 dataset to show the effectiveness of our method on different corpora and to compare with more data selection baselines. For Refined Web, we use a version4 that after heuristic filtering and deduplication following DCLM and randomly sampled a subset of needed number of tokens. We evenly sampled from each global index and local index. For C4, we use the whole dataset, which is about 198B tokens. 4https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html
Dataset Splits Yes Following the suggested setting in Li et al. (2024a), we randomly sample 80 billion, 300 billion, and 1 trillion tokens as the data selection pool for 400M, 1B, and 3B model training respectively and select 10% of the data for training unless otherwise specified, which corresponds to the Chinchilla optimal training data size (Hoffmann et al., 2024).
Hardware Specification Yes Resource For our pre-training, we mainly use 8 H800 (1 node) for training 1B models. While for some relatively large experiments such as 3B models, we use 4 nodes 8 H800 distributed training.
Software Dependencies No Framework For our main experiments on Refined Web, we follow MAP-NEO (Zhang et al., 2024a) and adapted a Megatron-based (Shoeybi et al., 2020) training framework which allows us to efficient training models with different sizes under single-node or multi-node. Since the largest model size is 3B, so we do not have to use any tensor parallelism or pipeline parallelism. While for the experiments under C4, following MATES, we use litgpt (AI, 2023), which is the training framework used by Tiny Llama (Zhang et al., 2024b). The paper mentions software frameworks but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes For pretraining hyperparameters, consistent with many open-sourced models, we consistently use a batch size 1,048,576 tokens, which is 4096 context length 256 global batch size. Also widely used Adam W optimizer and cosine decay learning rate schedular are used. For Pythia models, we try the best to keep the same training setting with MATES (Yu et al., 2024).The detailed training hyperprameters are listed in Table 10 below.