Nesterov Method for Asynchronous Pipeline Parallel Optimization
Authors: Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, Alexander Long
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the merits of our approach on large-scale language modelling tasks... Our experiments clearly demonstrate the feasibility of asynchronous PP optimization in the large-scale setting. 5. Experiments: We evaluate our method on the language modelling task using decoder-only architectures. We use three large-scale datasets: Wiki Text (WT) (Merity et al., 2016), Book Corpus (BC) (Zhu et al., 2015), and Open Web Text (OWT) (Gokaslan et al., 2019) datasets. ... Ablation Study |
| Researcher Affiliation | Industry | 1Pluralis Research. Correspondence to: Thalaiyasingam Ajanthan <EMAIL>. |
| Pseudocode | No | The paper includes mathematical equations for the Nesterov method but does not present any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/ Pluralis Research/Async PP. |
| Open Datasets | Yes | We use three large-scale datasets: Wiki Text (WT) (Merity et al., 2016), Book Corpus (BC) (Zhu et al., 2015), and Open Web Text (OWT) (Gokaslan et al., 2019) datasets. |
| Dataset Splits | Yes | For Wiki Text, we utilize the predefined training and validation splits, for the other datasets, we randomly select 10% of the training set as the held-out validation set. |
| Hardware Specification | Yes | All experiments are performed on a system equipped with 8 A10G GPUs. ... These experiments are performed on a system equipped with 8 A100 GPUs. ... Each worker node is assigned an NVIDIA L4 GPU. |
| Software Dependencies | Yes | In the Py Torch implementation of NAdam (Py Torch Contributors, 2025)... Nadam optimizer pytorch 2.5.0 documentation. https://pytorch.org/docs/ stable/generated/torch.optim.NAdam. html, 2025. Accessed: 2025-01-16. |
| Experiment Setup | Yes | Across all experiments, we maintain a microbatch size of 8, a learning rate η of 3e-4, and a weight decay of 0.01, unless otherwise specified. ... Each experiment is run for 50k iterations, with a linear warmup of 3k iterations starting from a learning rate of 1e-7. Then, it is decayed to 3e-5 following a cosine decay schedule. ... Our proposed method is denoted as Ours, which employs the Nadam optimizer (Dozat, 2016) with decoupled weight decay and a momentum coefficient β1 of 0.99. |