Small-to-Large Generalization: Training Data Influences Models Consistently Across Scale

Authors: Alaa Khaddaj, Logan Engstrom, Aleksander Madry

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After training language models (LMs) on a diverse set of training data distributions at different scales, we find that the answer is nuanced. On one hand, choice of training data distribution generally affects model predictions (very) similarly along compute scale (down to 175 smaller than the large-scale reference model, cf. Figure 1). Indeed, such a relationship even holds when proxy models are so small that their predictions are as accurate as randomly guessing. To do so, we compare how changes in training data distribution affect large-scale model predictions compared to those of small-scale proxy models trained on the same data distributions. Correlating these differences across a diverse set of training data distributions, we find that training data generally influences model predictions similarly across scale but that the degree of correlation depends on both the exact choice of test distribution and proxy model scale. In what follows, we first describe our experimental setup, then detail results (see Appendix B for additional details).
Researcher Affiliation Academia Alaa Khaddaj EMAIL MIT Logan Engstrom EMAIL MIT Aleksander Madry EMAIL MIT
Pseudocode Yes Algorithm 1 Computing the datamodel vector w DM. Algorithm 2 Approximating the datamodel vector using TRAK for multi-class classification. Algorithm 3 Dataset selection using datamodels (DSDM).
Open Source Code No The paper mentions using 'llm-foundry repository (Mosaic ML, 2023b)' and integrating 'µP Git Hub library' as third-party tools. It also points to 'https://github.com/MadryLab/DsDm' for data. However, there is no explicit statement or link indicating that the authors' own code for the methodologies described in this paper is publicly available or released.
Open Datasets Yes We measure how model behavior changes across 10 separate training distributions: 6 data-sources (i.e., sampled from a single data source like Wikipedia (Foundation, 2022)) and 4 selection-induced distributions (i.e., data selected with one of three dataset selection methods: DSDM (Engstrom et al., 2024), DSIR (Xie et al., 2023b) and Classifier-based approach (Brown et al., 2020) using various target tasks). After training (separate) models on each of these training datasets, we compare the resulting model behavior (losses) on 6 test datasets: C4 (Raffel et al., 2020), the Pile (Gao et al., 2020), SQuAD (Rajpurkar et al., 2016), LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2018) and Trivia QA (Joshi et al., 2017). The dataset we consider for the vision setting is the CIFAR-10 dataset (Krizhevsky, 2009). We study how well datamodels computed from smaller proxy models approximate the actual loss of the reference model in two supervised computer vision settings: ImageNet-1k (Russakovsky et al., 2015) and CIFAR-10 (Krizhevsky, 2009).
Dataset Splits Yes SQuAD: Similar to (Engstrom et al., 2024), we split the dataset into a holdout set of 10,557 samples (corresponding to the SQuAD validation set) and a target set of 23,107 examples (corresponding to 25% of the SQuAD training set). LAMBADA: Similar to (Engstrom et al., 2024), we split the dataset into a holdout set of 2,570 samples and a target set of 2,577 samples. Random: We remove at random up to 10% of the training examples. Same Class: For each test example, we remove at random up to {25% 50% 75%} of the training examples from the same class.
Hardware Specification No The paper refers to 'available, academic-level compute budget' and 'compute scale' but does not specify any particular GPU models, CPU processors, or detailed hardware configurations used for running the experiments. Table 9 presents compute requirements in FLOPS but does not list specific hardware.
Software Dependencies No The paper mentions 'GPT-NeoX tokenizer (Andonian et al., 2023)', 'llm-foundry repository (Mosaic ML, 2023b)', and 'µP framework (Yang et al., 2022)' without providing specific version numbers for these software components or libraries.
Experiment Setup Yes Our proxy models range in size from 40M parameters to 760M parameters, with each model training on a number of tokens determined by Chinchilla-optimal token-to-parameter ratios (Kaplan et al., 2020). We use the llm-foundry repository (Mosaic ML, 2023b) for training and evaluating our models. We train all our models using the same set of hyperparameters, presented in Table 16. To ensure that our hyperparameters are compatible with all our models of different sizes, we leverage the µP framework (Yang et al., 2022) in our implementation. Hyperparameter CIFAR (Krizhevsky, 2009) ImageNet (Krizhevsky et al., 2012): Optimizer SGD, LR Scheduler One Cycle, Max LR 0.1/0.5, Initial LR 0.001/0.005, LR Decay Linear/Cosine, Warmup (%) 0.05, Epochs 30/20, Batch Size 512, Weight Decay 0.0005. When training, our models, we pack the tokens from our pre-tokenized dataset into samples of context length 2,048. For the rest of the training hyperparameters, we keep the original values used in the Git Hub repository.