reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

Authors: Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, Yang You

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate DSP s superiority over state-of-the-art sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10 , with at least 50% communication volume reduction. ... Experiments are conducted on 128 NVIDIA H100 GPUs...
Researcher Affiliation	Academia	Xuanlei Zhao 1 Shenggan Cheng 1 Chang Chen 1 Zangwei Zheng 1 Ziming Liu 1 Zheming Yang 1 Yang You 1 1National University of Singapore. Correspondence to: Yang You <EMAIL>.
Pseudocode	No	The paper describes the Dynamic Sequence Parallelism (DSP) method conceptually and with diagrams, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions that 'The code is implemented using Py Torch' and that 'DSP can be enabled on Py Torch' with a high-level API, but it does not provide an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a repository.
Open Datasets	No	The paper describes the characteristics of the sequences used in experiments, such as 'The spatial sequence, representing video resolution, was fixed at 1024x1024' and 'The temporal sequence, representing video length, scales linearly in the test', but it does not specify a publicly available dataset by name, citation, or provide any access information.
Dataset Splits	No	The paper describes various experiment settings related to model size, sequence length, and batch size for weak and strong scaling evaluations, but it does not provide specific details on training, validation, or test dataset splits.
Hardware Specification	Yes	Experiments are conducted on 128 NVIDIA H100 GPUs, interconnected via NVLink within nodes and InfiniBand across nodes.
Software Dependencies	No	The paper states, 'The code is implemented using Py Torch (Paszke et al., 2019)', but does not specify the version number of PyTorch or any other software libraries or dependencies used in their implementation.
Experiment Setup	Yes	In the experiments, we use 720M and 3B size for 2D-Transformer. There specific model settings are shown in Table 5. ... For each method, the minimum sequence parallel size that would not result in out-of-memory errors was employed to reduce communication overhead, with data parallelism employed for the remaining size. ZeRO-2 was used for all methods except Megatron-SP. ... The accumulated sequence length ranged from 0.5M to 4M... The spatial sequence... was fixed at 1024x1024... The final length for the spatial sequence was 4096. The temporal sequence... scales linearly in the test. ... In the weak scaling experiments... the batch size is linearly increased proportional to the number of GPUs, while the sequence length is fixed. In the strong scaling experiments... both batch size and sequence length are fixed. Specific parallel sizes are detailed in Table 4, and scaling experiment settings in Tables 6 and 7.