reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spaced Scheduling for Large Language Model Training

Authors: Amine El hattami, Nicolas Chapados, Christopher Pal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on seven LLMs (0.5B to 32B parameters) in the instruction-finetuning (IFT) setting show that Sst consistently outperforms representative state-of-the-art selection approaches like Deita and Ins Tag on the Open LLM Leaderboard.
Researcher Affiliation	Collaboration	Amine El Hattami EMAIL Service Now Research, Mila, Polytechnique Montréal Nicolas Chappados Service Now Research Christoper Pal Service Now Research, Mila, Polytechnique Montréal, Canada CIFAR AI Chair
Pseudocode	Yes	Algorithm 1 Spaced Scheduled Training (Sst) (full version in Algorithm 2).
Open Source Code	Yes	We release our training code, trained models, and data mixes in our public repository1. 1https://github.com/Am1n3e/sst
Open Datasets	Yes	We use a stratified subsample of 100k examples from the recent Tulu 3 SFT Mix(Lambert et al., 2024) containing 15 datasets across diverse tasks and domains. These include FLAN v2 (Longpre et al., 2023), No Robots (Rajani et al., 2023), Open Assistant (Köpf et al., 2023), Tulu 3 Persona MATH, Tulu 3 Persona GSM, Tulu 3 Persona Python, Tulu 3 Persona Algebra, Tulu 3 Persona IF (Lambert et al., 2024), Numina Math-TIR (LI et al., 2024), Aya (Singh et al., 2024), Wild Chat GPT-4 (Zhao et al., 2024), Table GPT (Li et al., 2023), Sci RIFF (Köpf et al., 2023), Evol Code Alpaca (Luo et al., 2023).
Dataset Splits	Yes	Each method selects 30k examples from a 100k data pool, as our findings in 3 indicate that this subset is sufficient to match the performance of using the full dataset.
Hardware Specification	Yes	All models were trained on 8 NVIDIA H100 GPUs using Flash Attention 2 (Dao, 2024) and Deep Speed Zero-Stage 3 (Rajbhandari et al., 2019).
Software Dependencies	Yes	All models were trained on 8 NVIDIA H100 GPUs using Flash Attention 2 (Dao, 2024) and Deep Speed Zero-Stage 3 (Rajbhandari et al., 2019).
Experiment Setup	Yes	We perform full parameter training for two epochs with an effective batch size of 128, a learning rate (LR) of 5e 06 using a linear LR scheduler with a 3% warm-up ratio. We set the maximum sequence length to 2048.