Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

Authors: WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three heterogeneous clusters, comprising six different types of GPUs, demonstrate that Poplar achieves a training throughput improvement of 1.02 3.92x over current state-of-the-art heterogeneous training systems.
Researcher Affiliation Academia 1School of Computer Science, Peking University 2Center for Information Research, Academy of Military Sciences 3Advanced Institute of Big Data EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Heterogeneity Aware of each GPU
Open Source Code No We will publish all source codes of this work on Github for further research explorations.
Open Datasets Yes All experiments are evaluated on wikitext2-v1 dataset(Merity et al. 2016).
Dataset Splits No All experiments are evaluated on wikitext2-v1 dataset(Merity et al. 2016).
Hardware Specification Yes Our experiments are conducted on three heterogeneous GPU clusters, each cluster contains two types of GPUs, as shown in Table 1. ... A100 80GB A100 40GB ... V100 16GB T4 16GB ... A800 80GB V100S 32GB
Software Dependencies No We have implemented our work on Py Torch with around 2000+ lines of code.
Experiment Setup Yes We maintain a global batch size of 2 million tokens throughout our experiments.