reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Nesterov Method for Asynchronous Pipeline Parallel Optimization

Authors: Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, Alexander Long

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the merits of our approach on large-scale language modelling tasks... Our experiments clearly demonstrate the feasibility of asynchronous PP optimization in the large-scale setting. 5. Experiments: We evaluate our method on the language modelling task using decoder-only architectures. We use three large-scale datasets: Wiki Text (WT) (Merity et al., 2016), Book Corpus (BC) (Zhu et al., 2015), and Open Web Text (OWT) (Gokaslan et al., 2019) datasets. ... Ablation Study
Researcher Affiliation	Industry	1Pluralis Research. Correspondence to: Thalaiyasingam Ajanthan <EMAIL>.
Pseudocode	No	The paper includes mathematical equations for the Nesterov method but does not present any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/ Pluralis Research/Async PP.
Open Datasets	Yes	We use three large-scale datasets: Wiki Text (WT) (Merity et al., 2016), Book Corpus (BC) (Zhu et al., 2015), and Open Web Text (OWT) (Gokaslan et al., 2019) datasets.
Dataset Splits	Yes	For Wiki Text, we utilize the predefined training and validation splits, for the other datasets, we randomly select 10% of the training set as the held-out validation set.
Hardware Specification	Yes	All experiments are performed on a system equipped with 8 A10G GPUs. ... These experiments are performed on a system equipped with 8 A100 GPUs. ... Each worker node is assigned an NVIDIA L4 GPU.
Software Dependencies	Yes	In the Py Torch implementation of NAdam (Py Torch Contributors, 2025)... Nadam optimizer pytorch 2.5.0 documentation. https://pytorch.org/docs/ stable/generated/torch.optim.NAdam. html, 2025. Accessed: 2025-01-16.
Experiment Setup	Yes	Across all experiments, we maintain a microbatch size of 8, a learning rate η of 3e-4, and a weight decay of 0.01, unless otherwise specified. ... Each experiment is run for 50k iterations, with a linear warmup of 3k iterations starting from a learning rate of 1e-7. Then, it is decayed to 3e-5 following a cosine decay schedule. ... Our proposed method is denoted as Ours, which employs the Nadam optimizer (Dozat, 2016) with decoupled weight decay and a momentum coefficient β1 of 0.99.