PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs
Authors: Oskar van der Wal, Pietro Lesci, Max Müller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, Stella R Biderman
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using these new 45 training runs, in addition to the 5 already available, we study the effects of different initial conditions determined by the seed i.e., parameters initialisation and data order on (i) downstream performance, (ii) learned linguistic representations, and (iii) emergence of training phases. In addition to common scaling behaviours, our analyses generally reveal highly consistent training dynamics across both model sizes and initial conditions. Further, the new seeds for each model allow us to identify outlier training runs and delineate their characteristics. Our findings show the potential of using these methods to predict training stability. |
| Researcher Affiliation | Collaboration | Oskar van der Wal University of Amsterdam Pietro Lesci University of Cambridge Max Müller-Eberstein IT University of Copenhagen Naomi Saphra Harvard University Hailey Schoelkopf Anthropic Willem Zuidema University of Amsterdam Stella Biderman Eleuther AI |
| Pseudocode | No | The paper describes the methodology and analyses results but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides links to third-party codebases used for training and evaluation (GPT-Neox codebase, LM Evaluation Harness) and mentions releasing model checkpoints and pre-shuffled datasets, but it does not explicitly state that the code for the specific analysis methodology described in the paper is released. |
| Open Datasets | Yes | We introduce the Poly Pythias: an extension of the Pythia model suite (Biderman et al., 2023b) trained on the Pile dataset (Gao et al., 2021)... We use the standard (i.e., non-deduplicated) version of the Pile and release the tokenised and pre-shuffled datasets corresponding to the different seeds. More training details are in App. A. The indices used to recreate the pre-shuffled datasets are available at huggingface.co/datasets/Eleuther AI/pile-preshuffled-seeds. Additionally, the paper references numerous well-known public benchmarks such as ARC, LAMBADA, Logiqa, Piqa, Sci Q, Wino Grande, WSC, BLi MP (Gender Agreement), Crow S-Pairs (Gender), and Simple Co-occurrence Bias, all with proper citations. |
| Dataset Splits | No | The paper describes the Pile dataset as a '300B-token curated collection of English documents' that is 'shuffled and packed into sequences of 2,049 tokens' for training. It mentions that models were evaluated using 'validation loss' but does not explicitly provide specific training/validation/test splits for the Pile dataset or for the downstream tasks, nor does it define how a validation set was derived. |
| Hardware Specification | No | The paper mentions that computational resources were provided by Stability AI and that experiments were supported by the IT University of Copenhagen's High-Performance Computing Cluster. However, it does not provide specific hardware details such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | Yes | We used the v1.0 version of the GPT-Neox codebase9 for model training. |
| Experiment Setup | Yes | Training was performed using a cosine learning rate schedule with warm-up, and using a batch size of 1,024 sequences, resulting in exactly 143k optimisation steps. |