reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models

Authors: Jialiang Cheng, Ning Gao, Yun Yue, Zhiling Ye, Jiadi Jiang, Jian Sha

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the superior performance of EDi T/A-EDi T, establishing them as robust solutions for distributed LLM training in diverse computational ecosystems. We conduct a series of experiments to validate large-scale asynchronous training for LLMs, accompanied by comprehensive analyses.
Researcher Affiliation	Industry	Jialiang Cheng, Ning Gao, Yun Yue, Zhiling Ye, Jiadi Jiang, Jian Sha Ant Group EMAIL EMAIL
Pseudocode	Yes	Here we provide the formal descriptions of EDi T method and model synchronization with pseudo gradient penalty strategy in Algorithm 1 and Algorithm 2, respectively, to help readers better understand our work. Algorithm 1 EDi T Algorithm. Algorithm 2 Sync() in Algorithm 1
Open Source Code	Yes	The code is available at Atorch codebase 1. 1https://github.com/intelligent-machine-learning/atorch/
Open Datasets	Yes	We use a new large-scale open-source dataset, Fine Web-Edu (Lozhkov et al., 2024) in our experiments. This dataset comprises 1.3T tokens of premium educational web pages filtered from the extensive Fine Web repository (Penedo et al., 2024).
Dataset Splits	No	The paper mentions using Fine Web-Edu and an in-house dataset for training and evaluates performance using 'Validation Perplexity' and 'public benchmarks'. However, it does not explicitly state the specific training/test/validation splits (e.g., percentages or sample counts) for these datasets, only implying the use of a validation set.
Hardware Specification	Yes	The experimental infrastructure comprised eight Nvidia A100 GPU nodes with 64 GPUs and an 8 x 8 device mesh.
Software Dependencies	No	Following Di Lo Co (Douillard et al., 2023), we use Adam W (Loshchilov & Hutter, 2019) as the inner optimizer and Nesterov momentum (Nesterov, 1983) as the outer optimizer. This lists optimizers used but does not provide specific version numbers for software libraries or environments like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Synchronization intervals τ and τtime are set to 128 and 600s, respectively. ... ϕ is 10 for pseudo gradient clip. ... a context length of 4,096 tokens and a cosine learning rate decay schedule are consistently applied. For the Fine Web-Edu dataset ..., the total batch size is set to 1,024 and the training step is set to 100,000 (≈ 420B tokens). The learning rate for Baseline, inner learning rate, outer learning rate, and outer momentum are set to 3e-4, 1.5e-4, 0.8, and 0.85, respectively.