LASP: Linear Attention Sequence Parallelism
Authors: Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8 longer than existing SP methods. Code is available at: https://github.com/Open NLPLab/LASP. ... Ablation Study ... Scalability and Speed Comparison ... Convergence Performance of LASP. |
| Researcher Affiliation | Collaboration | Weigao Sun EMAIL Shanghai AI Laboratory Zhen Qin EMAIL Tap Tap Dong Li EMAIL Shanghai AI Laboratory Xuyang Shen EMAIL Shanghai AI Laboratory Yu Qiao EMAIL Shanghai AI Laboratory Yiran Zhong EMAIL Shanghai AI Laboratory |
| Pseudocode | Yes | Algorithm 1 LASP Data Distribution Algorithm 2 LASP Forward Pass Algorithm 3 LASP Backward Pass |
| Open Source Code | Yes | Code is available at: https://github.com/Open NLPLab/LASP. |
| Open Datasets | Yes | The experiments were conducted using the same training corpus: the Pile (Gao et al., 2020). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It mentions using a batch size for experiments but no details on how the dataset (The Pile) itself was partitioned into different sets for training, validation, or testing. |
| Hardware Specification | Yes | All experiments are conducted on a GPU cluster equipped with 128x A100 80G GPUs. Our experimental configuration involves a maximum of 16x DGX-A100 servers, each equipped with 8x A100 GPUs. |
| Software Dependencies | Yes | Experiments are implemented in Py Torch 2.1.1 and Triton 2.0.0 with CUDA 11.7, cu DNN 8.0, and NCCL 2.14.3. |
| Experiment Setup | Yes | The training configuration is set with specific hyperparameters: a learning rate of 0.0005 to control the optimization step size, a cap of 50,000 updates to define the training duration, and a 2,000-update warmup period to stabilize early training by gradually adjusting the learning rate. Additionally, a weight decay rate of 0.01 is used for regularization to avoid over-fitting (Sun et al., 2024). The Adam optimizer, with beta values of 0.9 and 0.999, is chosen for managing the momentum and scaling of gradients, aiding in effective and stable training convergence (Zhou et al., 2020). |