EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models

Authors: Jialiang Cheng, Ning Gao, Yun Yue, Zhiling Ye, Jiadi Jiang, Jian Sha

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the superior performance of EDi T/A-EDi T, establishing them as robust solutions for distributed LLM training in diverse computational ecosystems. We conduct a series of experiments to validate large-scale asynchronous training for LLMs, accompanied by comprehensive analyses.
Researcher Affiliation Industry Jialiang Cheng, Ning Gao, Yun Yue, Zhiling Ye, Jiadi Jiang, Jian Sha Ant Group EMAIL EMAIL
Pseudocode Yes Here we provide the formal descriptions of EDi T method and model synchronization with pseudo gradient penalty strategy in Algorithm 1 and Algorithm 2, respectively, to help readers better understand our work. Algorithm 1 EDi T Algorithm. Algorithm 2 Sync() in Algorithm 1
Open Source Code Yes The code is available at Atorch codebase 1. 1https://github.com/intelligent-machine-learning/atorch/
Open Datasets Yes We use a new large-scale open-source dataset, Fine Web-Edu (Lozhkov et al., 2024) in our experiments. This dataset comprises 1.3T tokens of premium educational web pages filtered from the extensive Fine Web repository (Penedo et al., 2024).
Dataset Splits No The paper mentions using Fine Web-Edu and an in-house dataset for training and evaluates performance using 'Validation Perplexity' and 'public benchmarks'. However, it does not explicitly state the specific training/test/validation splits (e.g., percentages or sample counts) for these datasets, only implying the use of a validation set.
Hardware Specification Yes The experimental infrastructure comprised eight Nvidia A100 GPU nodes with 64 GPUs and an 8 x 8 device mesh.
Software Dependencies No Following Di Lo Co (Douillard et al., 2023), we use Adam W (Loshchilov & Hutter, 2019) as the inner optimizer and Nesterov momentum (Nesterov, 1983) as the outer optimizer. This lists optimizers used but does not provide specific version numbers for software libraries or environments like Python, PyTorch, or CUDA.
Experiment Setup Yes Synchronization intervals τ and τtime are set to 128 and 600s, respectively. ... ϕ is 10 for pseudo gradient clip. ... a context length of 4,096 tokens and a cosine learning rate decay schedule are consistently applied. For the Fine Web-Edu dataset ..., the total batch size is set to 1,024 and the training step is set to 100,000 (≈ 420B tokens). The learning rate for Baseline, inner learning rate, outer learning rate, and outer momentum are set to 3e-4, 1.5e-4, 0.8, and 0.85, respectively.