DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

Authors: Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, Yang You

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate DSP s superiority over state-of-the-art sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10 , with at least 50% communication volume reduction. ... Experiments are conducted on 128 NVIDIA H100 GPUs...
Researcher Affiliation Academia Xuanlei Zhao 1 Shenggan Cheng 1 Chang Chen 1 Zangwei Zheng 1 Ziming Liu 1 Zheming Yang 1 Yang You 1 1National University of Singapore. Correspondence to: Yang You <EMAIL>.
Pseudocode No The paper describes the Dynamic Sequence Parallelism (DSP) method conceptually and with diagrams, but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions that 'The code is implemented using Py Torch' and that 'DSP can be enabled on Py Torch' with a high-level API, but it does not provide an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a repository.
Open Datasets No The paper describes the characteristics of the sequences used in experiments, such as 'The spatial sequence, representing video resolution, was fixed at 1024x1024' and 'The temporal sequence, representing video length, scales linearly in the test', but it does not specify a publicly available dataset by name, citation, or provide any access information.
Dataset Splits No The paper describes various experiment settings related to model size, sequence length, and batch size for weak and strong scaling evaluations, but it does not provide specific details on training, validation, or test dataset splits.
Hardware Specification Yes Experiments are conducted on 128 NVIDIA H100 GPUs, interconnected via NVLink within nodes and InfiniBand across nodes.
Software Dependencies No The paper states, 'The code is implemented using Py Torch (Paszke et al., 2019)', but does not specify the version number of PyTorch or any other software libraries or dependencies used in their implementation.
Experiment Setup Yes In the experiments, we use 720M and 3B size for 2D-Transformer. There specific model settings are shown in Table 5. ... For each method, the minimum sequence parallel size that would not result in out-of-memory errors was employed to reduce communication overhead, with data parallelism employed for the remaining size. ZeRO-2 was used for all methods except Megatron-SP. ... The accumulated sequence length ranged from 0.5M to 4M... The spatial sequence... was fixed at 1024x1024... The final length for the spatial sequence was 4096. The temporal sequence... scales linearly in the test. ... In the weak scaling experiments... the batch size is linearly increased proportional to the number of GPUs, while the sequence length is fixed. In the strong scaling experiments... both batch size and sequence length are fixed. Specific parallel sizes are detailed in Table 4, and scaling experiment settings in Tables 6 and 7.