reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs

Authors: Oskar van der Wal, Pietro Lesci, Max Müller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, Stella R Biderman

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using these new 45 training runs, in addition to the 5 already available, we study the effects of different initial conditions determined by the seed i.e., parameters initialisation and data order on (i) downstream performance, (ii) learned linguistic representations, and (iii) emergence of training phases. In addition to common scaling behaviours, our analyses generally reveal highly consistent training dynamics across both model sizes and initial conditions. Further, the new seeds for each model allow us to identify outlier training runs and delineate their characteristics. Our findings show the potential of using these methods to predict training stability.
Researcher Affiliation	Collaboration	Oskar van der Wal University of Amsterdam Pietro Lesci University of Cambridge Max Müller-Eberstein IT University of Copenhagen Naomi Saphra Harvard University Hailey Schoelkopf Anthropic Willem Zuidema University of Amsterdam Stella Biderman Eleuther AI
Pseudocode	No	The paper describes the methodology and analyses results but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper provides links to third-party codebases used for training and evaluation (GPT-Neox codebase, LM Evaluation Harness) and mentions releasing model checkpoints and pre-shuffled datasets, but it does not explicitly state that the code for the specific analysis methodology described in the paper is released.
Open Datasets	Yes	We introduce the Poly Pythias: an extension of the Pythia model suite (Biderman et al., 2023b) trained on the Pile dataset (Gao et al., 2021)... We use the standard (i.e., non-deduplicated) version of the Pile and release the tokenised and pre-shuffled datasets corresponding to the different seeds. More training details are in App. A. The indices used to recreate the pre-shuffled datasets are available at huggingface.co/datasets/Eleuther AI/pile-preshuffled-seeds. Additionally, the paper references numerous well-known public benchmarks such as ARC, LAMBADA, Logiqa, Piqa, Sci Q, Wino Grande, WSC, BLi MP (Gender Agreement), Crow S-Pairs (Gender), and Simple Co-occurrence Bias, all with proper citations.
Dataset Splits	No	The paper describes the Pile dataset as a '300B-token curated collection of English documents' that is 'shuffled and packed into sequences of 2,049 tokens' for training. It mentions that models were evaluated using 'validation loss' but does not explicitly provide specific training/validation/test splits for the Pile dataset or for the downstream tasks, nor does it define how a validation set was derived.
Hardware Specification	No	The paper mentions that computational resources were provided by Stability AI and that experiments were supported by the IT University of Copenhagen's High-Performance Computing Cluster. However, it does not provide specific hardware details such as GPU models, CPU models, or memory specifications.
Software Dependencies	Yes	We used the v1.0 version of the GPT-Neox codebase9 for model training.
Experiment Setup	Yes	Training was performed using a cosine learning rate schedule with warm-up, and using a batch size of 1,024 sequences, resulting in exactly 143k optimisation steps.