Spaced Scheduling for Large Language Model Training

Authors: Amine El hattami, Nicolas Chapados, Christopher Pal

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on seven LLMs (0.5B to 32B parameters) in the instruction-finetuning (IFT) setting show that Sst consistently outperforms representative state-of-the-art selection approaches like Deita and Ins Tag on the Open LLM Leaderboard.
Researcher Affiliation Collaboration Amine El Hattami EMAIL Service Now Research, Mila, Polytechnique Montréal Nicolas Chappados Service Now Research Christoper Pal Service Now Research, Mila, Polytechnique Montréal, Canada CIFAR AI Chair
Pseudocode Yes Algorithm 1 Spaced Scheduled Training (Sst) (full version in Algorithm 2).
Open Source Code Yes We release our training code, trained models, and data mixes in our public repository1. 1https://github.com/Am1n3e/sst
Open Datasets Yes We use a stratified subsample of 100k examples from the recent Tulu 3 SFT Mix(Lambert et al., 2024) containing 15 datasets across diverse tasks and domains. These include FLAN v2 (Longpre et al., 2023), No Robots (Rajani et al., 2023), Open Assistant (Köpf et al., 2023), Tulu 3 Persona MATH, Tulu 3 Persona GSM, Tulu 3 Persona Python, Tulu 3 Persona Algebra, Tulu 3 Persona IF (Lambert et al., 2024), Numina Math-TIR (LI et al., 2024), Aya (Singh et al., 2024), Wild Chat GPT-4 (Zhao et al., 2024), Table GPT (Li et al., 2023), Sci RIFF (Köpf et al., 2023), Evol Code Alpaca (Luo et al., 2023).
Dataset Splits Yes Each method selects 30k examples from a 100k data pool, as our findings in 3 indicate that this subset is sufficient to match the performance of using the full dataset.
Hardware Specification Yes All models were trained on 8 NVIDIA H100 GPUs using Flash Attention 2 (Dao, 2024) and Deep Speed Zero-Stage 3 (Rajbhandari et al., 2019).
Software Dependencies Yes All models were trained on 8 NVIDIA H100 GPUs using Flash Attention 2 (Dao, 2024) and Deep Speed Zero-Stage 3 (Rajbhandari et al., 2019).
Experiment Setup Yes We perform full parameter training for two epochs with an effective batch size of 128, a learning rate (LR) of 5e 06 using a linear LR scheduler with a 3% warm-up ratio. We set the maximum sequence length to 2048.