Optimal Distributed Training With Co-Adaptive Data Parallelism in Heterogeneous Environments
Authors: Lifang Chen, Zhichao Chen, Liqi Yan, Yanyu Cheng, Fangli Guan, Pan Li
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Image Net100 dataset demonstrate that C-ADP achieves fast convergence in heterogeneous distributed training environments. Compared to Distributed Data Parallel (DDP) and Deep Speed, C-ADP achieves 21.6 times and 26.3 times improvements in FLOPS, respectively, and a reduction in training time of about 72% and 47%, respectively. |
| Researcher Affiliation | Academia | Lifang Chen , Zhichao Chen , Liqi Yan , Yanyu Cheng , Fangli Guan and Pan Li Hangzhou Dianzi University EMAIL, chenzhichao EMAIL, EMAIL, EMAIL, EMAIL, EMAIL Corresponding authors: Pan Li, Liqi Yan. |
| Pseudocode | Yes | Algorithm 1 Our Data Parallel Scheduling Algorithm Input: S, n, ρ, µ(0), µdecay, λ(0), ε, η(0), ηdecay, ψ, ζ Output: ˆx 1: Let x S n1 , k 0 ,µ µ(0) ,λ λ(0) , η η(0) . 2: while k < ψ do 3: i 0 4: while i < ζ do 5: Compute gradients x Lρ(x(k,i), λ(k), µ(k)) 6: Update x(k,i+1) = x(k,i) η(k) x Lρ(x(k,i), λ(k), µ(k)) 7: if L 2 ε then 8: break 9: end if 10: i i + 1 11: end while Update λ(k+1) λ(k) + ρ j x(k+1,0) j S Update µ(k+1) µdecay µ(k) 12: if L 2 ε then 13: break 14: end if 15: k k + 1 16: end while 17: x x 18: ˆx Round ( x ) 19: return ˆx |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | The experiments were carried out using the Image Net100 dataset, a subset of the Image Net dataset. |
| Dataset Splits | Yes | Image Net100 consists of 100 classes, particularly 100,000 training samples and 10,000 test samples, making it a balanced and diverse dataset for evaluating model performance. |
| Hardware Specification | Yes | Ranks 1 4 use RTX3090 GPUs with 24GB VRAM. Ranks 5 6 are 4 core CPUs with 8GB memory. Rank7 is a 16 core CPU with 32GB VRAM, and rank8 is an 8 core CPU with 16GB VRAM. |
| Software Dependencies | No | The paper mentions PyTorch as a framework used by DDP but does not specify a version number for PyTorch or any other software dependencies used in their own implementation. |
| Experiment Setup | Yes | For all GPUs, the batch size is set to 64 due to its high computational complexity, to ensure stable training within the GPU s memory capacity. For all CPUs, the batch size is set to 4, considering their limited computing power. All settings are designed to keep operations within the device s memory and computational limits. The maximum number of epochs is set to 50. The initial learning rate is 0.01. To improve convergence, we reduce the learning rate by half after every 10 epochs. |