EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
Authors: Jialiang Cheng, Ning Gao, Yun Yue, Zhiling Ye, Jiadi Jiang, Jian Sha
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the superior performance of EDi T/A-EDi T, establishing them as robust solutions for distributed LLM training in diverse computational ecosystems. We conduct a series of experiments to validate large-scale asynchronous training for LLMs, accompanied by comprehensive analyses. |
| Researcher Affiliation | Industry | Jialiang Cheng, Ning Gao, Yun Yue, Zhiling Ye, Jiadi Jiang, Jian Sha Ant Group EMAIL EMAIL |
| Pseudocode | Yes | Here we provide the formal descriptions of EDi T method and model synchronization with pseudo gradient penalty strategy in Algorithm 1 and Algorithm 2, respectively, to help readers better understand our work. Algorithm 1 EDi T Algorithm. Algorithm 2 Sync() in Algorithm 1 |
| Open Source Code | Yes | The code is available at Atorch codebase 1. 1https://github.com/intelligent-machine-learning/atorch/ |
| Open Datasets | Yes | We use a new large-scale open-source dataset, Fine Web-Edu (Lozhkov et al., 2024) in our experiments. This dataset comprises 1.3T tokens of premium educational web pages filtered from the extensive Fine Web repository (Penedo et al., 2024). |
| Dataset Splits | No | The paper mentions using Fine Web-Edu and an in-house dataset for training and evaluates performance using 'Validation Perplexity' and 'public benchmarks'. However, it does not explicitly state the specific training/test/validation splits (e.g., percentages or sample counts) for these datasets, only implying the use of a validation set. |
| Hardware Specification | Yes | The experimental infrastructure comprised eight Nvidia A100 GPU nodes with 64 GPUs and an 8 x 8 device mesh. |
| Software Dependencies | No | Following Di Lo Co (Douillard et al., 2023), we use Adam W (Loshchilov & Hutter, 2019) as the inner optimizer and Nesterov momentum (Nesterov, 1983) as the outer optimizer. This lists optimizers used but does not provide specific version numbers for software libraries or environments like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Synchronization intervals τ and τtime are set to 128 and 600s, respectively. ... ϕ is 10 for pseudo gradient clip. ... a context length of 4,096 tokens and a cosine learning rate decay schedule are consistently applied. For the Fine Web-Edu dataset ..., the total batch size is set to 1,024 and the training step is set to 100,000 (≈ 420B tokens). The learning rate for Baseline, inner learning rate, outer learning rate, and outer momentum are set to 3e-4, 1.5e-4, 0.8, and 0.85, respectively. |