Nesterov Method for Asynchronous Pipeline Parallel Optimization

Authors: Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, Alexander Long

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the merits of our approach on large-scale language modelling tasks... Our experiments clearly demonstrate the feasibility of asynchronous PP optimization in the large-scale setting. 5. Experiments: We evaluate our method on the language modelling task using decoder-only architectures. We use three large-scale datasets: Wiki Text (WT) (Merity et al., 2016), Book Corpus (BC) (Zhu et al., 2015), and Open Web Text (OWT) (Gokaslan et al., 2019) datasets. ... Ablation Study
Researcher Affiliation Industry 1Pluralis Research. Correspondence to: Thalaiyasingam Ajanthan <EMAIL>.
Pseudocode No The paper includes mathematical equations for the Nesterov method but does not present any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/ Pluralis Research/Async PP.
Open Datasets Yes We use three large-scale datasets: Wiki Text (WT) (Merity et al., 2016), Book Corpus (BC) (Zhu et al., 2015), and Open Web Text (OWT) (Gokaslan et al., 2019) datasets.
Dataset Splits Yes For Wiki Text, we utilize the predefined training and validation splits, for the other datasets, we randomly select 10% of the training set as the held-out validation set.
Hardware Specification Yes All experiments are performed on a system equipped with 8 A10G GPUs. ... These experiments are performed on a system equipped with 8 A100 GPUs. ... Each worker node is assigned an NVIDIA L4 GPU.
Software Dependencies Yes In the Py Torch implementation of NAdam (Py Torch Contributors, 2025)... Nadam optimizer pytorch 2.5.0 documentation. https://pytorch.org/docs/ stable/generated/torch.optim.NAdam. html, 2025. Accessed: 2025-01-16.
Experiment Setup Yes Across all experiments, we maintain a microbatch size of 8, a learning rate η of 3e-4, and a weight decay of 0.01, unless otherwise specified. ... Each experiment is run for 50k iterations, with a linear warmup of 3k iterations starting from a learning rate of 1e-7. Then, it is decayed to 3e-5 following a cosine decay schedule. ... Our proposed method is denoted as Ours, which employs the Nadam optimizer (Dozat, 2016) with decoupled weight decay and a momentum coefficient β1 of 0.99.