LASP: Linear Attention Sequence Parallelism

Authors: Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8 longer than existing SP methods. Code is available at: https://github.com/Open NLPLab/LASP. ... Ablation Study ... Scalability and Speed Comparison ... Convergence Performance of LASP.
Researcher Affiliation Collaboration Weigao Sun EMAIL Shanghai AI Laboratory Zhen Qin EMAIL Tap Tap Dong Li EMAIL Shanghai AI Laboratory Xuyang Shen EMAIL Shanghai AI Laboratory Yu Qiao EMAIL Shanghai AI Laboratory Yiran Zhong EMAIL Shanghai AI Laboratory
Pseudocode Yes Algorithm 1 LASP Data Distribution Algorithm 2 LASP Forward Pass Algorithm 3 LASP Backward Pass
Open Source Code Yes Code is available at: https://github.com/Open NLPLab/LASP.
Open Datasets Yes The experiments were conducted using the same training corpus: the Pile (Gao et al., 2020).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It mentions using a batch size for experiments but no details on how the dataset (The Pile) itself was partitioned into different sets for training, validation, or testing.
Hardware Specification Yes All experiments are conducted on a GPU cluster equipped with 128x A100 80G GPUs. Our experimental configuration involves a maximum of 16x DGX-A100 servers, each equipped with 8x A100 GPUs.
Software Dependencies Yes Experiments are implemented in Py Torch 2.1.1 and Triton 2.0.0 with CUDA 11.7, cu DNN 8.0, and NCCL 2.14.3.
Experiment Setup Yes The training configuration is set with specific hyperparameters: a learning rate of 0.0005 to control the optimization step size, a cap of 50,000 updates to define the training duration, and a 2,000-update warmup period to stabilize early training by gradually adjusting the learning rate. Additionally, a weight decay rate of 0.01 is used for regularization to avoid over-fitting (Sun et al., 2024). The Adam optimizer, with beta values of 0.9 and 0.999, is chosen for managing the momentum and scaling of gradients, aiding in effective and stable training convergence (Zhou et al., 2020).