reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LASP: Linear Attention Sequence Parallelism

Authors: Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8 longer than existing SP methods. Code is available at: https://github.com/Open NLPLab/LASP. ... Ablation Study ... Scalability and Speed Comparison ... Convergence Performance of LASP.
Researcher Affiliation	Collaboration	Weigao Sun EMAIL Shanghai AI Laboratory Zhen Qin EMAIL Tap Tap Dong Li EMAIL Shanghai AI Laboratory Xuyang Shen EMAIL Shanghai AI Laboratory Yu Qiao EMAIL Shanghai AI Laboratory Yiran Zhong EMAIL Shanghai AI Laboratory
Pseudocode	Yes	Algorithm 1 LASP Data Distribution Algorithm 2 LASP Forward Pass Algorithm 3 LASP Backward Pass
Open Source Code	Yes	Code is available at: https://github.com/Open NLPLab/LASP.
Open Datasets	Yes	The experiments were conducted using the same training corpus: the Pile (Gao et al., 2020).
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. It mentions using a batch size for experiments but no details on how the dataset (The Pile) itself was partitioned into different sets for training, validation, or testing.
Hardware Specification	Yes	All experiments are conducted on a GPU cluster equipped with 128x A100 80G GPUs. Our experimental configuration involves a maximum of 16x DGX-A100 servers, each equipped with 8x A100 GPUs.
Software Dependencies	Yes	Experiments are implemented in Py Torch 2.1.1 and Triton 2.0.0 with CUDA 11.7, cu DNN 8.0, and NCCL 2.14.3.
Experiment Setup	Yes	The training configuration is set with specific hyperparameters: a learning rate of 0.0005 to control the optimization step size, a cap of 50,000 updates to define the training duration, and a 2,000-update warmup period to stabilize early training by gradually adjusting the learning rate. Additionally, a weight decay rate of 0.01 is used for regularization to avoid over-fitting (Sun et al., 2024). The Adam optimizer, with beta values of 0.9 and 0.999, is chosen for managing the momentum and scaling of gradients, aiding in effective and stable training convergence (Zhou et al., 2020).