reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Min Lin, Jialin Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption. The implementation is open-sourced at this url. ... We evaluate our methods on GPT-3-like models based on Megatron-LM (Narayanan et al., 2021). ... Our primary metrics are throughput, measured as model flops utilization (MFU), and activation memory, defined as the difference between peak and iteration-start memory.
Researcher Affiliation	Collaboration	1Sea AI Lab 2National University of Singapore. Correspondence to: Min Lin <EMAIL>, Jialin Li <EMAIL>.
Pseudocode	No	The paper describes methods and schedules using diagrams and textual explanations (e.g., Figure 4, Figure 5, Figure 6, Figure 7) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The implementation is open-sourced at this url.
Open Datasets	No	The paper mentions evaluating methods on 'GPT-3-like models' and referring to 'Llama 3 (Dubey et al., 2024)' and 'Deepseek v3 (Liu et al., 2024)', which are large language models. However, it does not provide concrete access information (link, DOI, repository, or formal citation for the specific dataset used for training these models) for a publicly available or open dataset.
Dataset Splits	No	The paper does not provide specific details about training/test/validation dataset splits, such as percentages, sample counts, or references to predefined splits, for the models used in the experiments. As no specific public dataset was identified, information on its splits is also absent.
Hardware Specification	Yes	Our experiments run on up to 32 NVIDIA A100 80G GPUs on 4 nodes interconnected by Ro CE RDMA network.
Software Dependencies	No	The paper mentions 'Megatron-LM' as the base for GPT-3-like models and refers to 'pytorch' and 'CUDA events' in Appendix C. However, it does not specify version numbers for these or any other key software components, libraries, or programming languages.
Experiment Setup	Yes	We evaluate our methods on GPT-3-like models based on Megatron-LM (Narayanan et al., 2021). In most cases, one transformer layer is removed from both the first and last pipeline stages to address imbalances caused by vocabulary layers, similar to Llama 3 (Dubey et al., 2024) and Deepseek v3 (Liu et al., 2024). The models used are listed in Table 2. ... For all models we turn on GQA (Ainslie et al., 2023) with number of query group set to 8. ... For all schedules except 1F1B, we set the number of stages on each device to the maximum possible value so that each stage has at most 1 transformer layer, unless explicitly specified.