Spaced Scheduling for Large Language Model Training
Authors: Amine El hattami, Nicolas Chapados, Christopher Pal
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on seven LLMs (0.5B to 32B parameters) in the instruction-finetuning (IFT) setting show that Sst consistently outperforms representative state-of-the-art selection approaches like Deita and Ins Tag on the Open LLM Leaderboard. |
| Researcher Affiliation | Collaboration | Amine El Hattami EMAIL Service Now Research, Mila, Polytechnique Montréal Nicolas Chappados Service Now Research Christoper Pal Service Now Research, Mila, Polytechnique Montréal, Canada CIFAR AI Chair |
| Pseudocode | Yes | Algorithm 1 Spaced Scheduled Training (Sst) (full version in Algorithm 2). |
| Open Source Code | Yes | We release our training code, trained models, and data mixes in our public repository1. 1https://github.com/Am1n3e/sst |
| Open Datasets | Yes | We use a stratified subsample of 100k examples from the recent Tulu 3 SFT Mix(Lambert et al., 2024) containing 15 datasets across diverse tasks and domains. These include FLAN v2 (Longpre et al., 2023), No Robots (Rajani et al., 2023), Open Assistant (Köpf et al., 2023), Tulu 3 Persona MATH, Tulu 3 Persona GSM, Tulu 3 Persona Python, Tulu 3 Persona Algebra, Tulu 3 Persona IF (Lambert et al., 2024), Numina Math-TIR (LI et al., 2024), Aya (Singh et al., 2024), Wild Chat GPT-4 (Zhao et al., 2024), Table GPT (Li et al., 2023), Sci RIFF (Köpf et al., 2023), Evol Code Alpaca (Luo et al., 2023). |
| Dataset Splits | Yes | Each method selects 30k examples from a 100k data pool, as our findings in 3 indicate that this subset is sufficient to match the performance of using the full dataset. |
| Hardware Specification | Yes | All models were trained on 8 NVIDIA H100 GPUs using Flash Attention 2 (Dao, 2024) and Deep Speed Zero-Stage 3 (Rajbhandari et al., 2019). |
| Software Dependencies | Yes | All models were trained on 8 NVIDIA H100 GPUs using Flash Attention 2 (Dao, 2024) and Deep Speed Zero-Stage 3 (Rajbhandari et al., 2019). |
| Experiment Setup | Yes | We perform full parameter training for two epochs with an effective batch size of 128, a learning rate (LR) of 5e 06 using a linear LR scheduler with a 3% warm-up ratio. We set the maximum sequence length to 2048. |