Optimal Distributed Training With Co-Adaptive Data Parallelism in Heterogeneous Environments

Authors: Lifang Chen, Zhichao Chen, Liqi Yan, Yanyu Cheng, Fangli Guan, Pan Li

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the Image Net100 dataset demonstrate that C-ADP achieves fast convergence in heterogeneous distributed training environments. Compared to Distributed Data Parallel (DDP) and Deep Speed, C-ADP achieves 21.6 times and 26.3 times improvements in FLOPS, respectively, and a reduction in training time of about 72% and 47%, respectively.
Researcher Affiliation Academia Lifang Chen , Zhichao Chen , Liqi Yan , Yanyu Cheng , Fangli Guan and Pan Li Hangzhou Dianzi University EMAIL, chenzhichao EMAIL, EMAIL, EMAIL, EMAIL, EMAIL Corresponding authors: Pan Li, Liqi Yan.
Pseudocode Yes Algorithm 1 Our Data Parallel Scheduling Algorithm Input: S, n, ρ, µ(0), µdecay, λ(0), ε, η(0), ηdecay, ψ, ζ Output: ˆx 1: Let x S n1 , k 0 ,µ µ(0) ,λ λ(0) , η η(0) . 2: while k < ψ do 3: i 0 4: while i < ζ do 5: Compute gradients x Lρ(x(k,i), λ(k), µ(k)) 6: Update x(k,i+1) = x(k,i) η(k) x Lρ(x(k,i), λ(k), µ(k)) 7: if L 2 ε then 8: break 9: end if 10: i i + 1 11: end while Update λ(k+1) λ(k) + ρ j x(k+1,0) j S Update µ(k+1) µdecay µ(k) 12: if L 2 ε then 13: break 14: end if 15: k k + 1 16: end while 17: x x 18: ˆx Round ( x ) 19: return ˆx
Open Source Code No The paper does not contain any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets Yes The experiments were carried out using the Image Net100 dataset, a subset of the Image Net dataset.
Dataset Splits Yes Image Net100 consists of 100 classes, particularly 100,000 training samples and 10,000 test samples, making it a balanced and diverse dataset for evaluating model performance.
Hardware Specification Yes Ranks 1 4 use RTX3090 GPUs with 24GB VRAM. Ranks 5 6 are 4 core CPUs with 8GB memory. Rank7 is a 16 core CPU with 32GB VRAM, and rank8 is an 8 core CPU with 16GB VRAM.
Software Dependencies No The paper mentions PyTorch as a framework used by DDP but does not specify a version number for PyTorch or any other software dependencies used in their own implementation.
Experiment Setup Yes For all GPUs, the batch size is set to 64 due to its high computational complexity, to ensure stable training within the GPU s memory capacity. For all CPUs, the batch size is set to 4, considering their limited computing power. All settings are designed to keep operations within the device s memory and computational limits. The maximum number of epochs is set to 50. The initial learning rate is 0.01. To improve convergence, we reduce the learning rate by half after every 10 epochs.