PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Min Lin, Jialin Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption. The implementation is open-sourced at this url. ... We evaluate our methods on GPT-3-like models based on Megatron-LM (Narayanan et al., 2021). ... Our primary metrics are throughput, measured as model flops utilization (MFU), and activation memory, defined as the difference between peak and iteration-start memory. |
| Researcher Affiliation | Collaboration | 1Sea AI Lab 2National University of Singapore. Correspondence to: Min Lin <EMAIL>, Jialin Li <EMAIL>. |
| Pseudocode | No | The paper describes methods and schedules using diagrams and textual explanations (e.g., Figure 4, Figure 5, Figure 6, Figure 7) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The implementation is open-sourced at this url. |
| Open Datasets | No | The paper mentions evaluating methods on 'GPT-3-like models' and referring to 'Llama 3 (Dubey et al., 2024)' and 'Deepseek v3 (Liu et al., 2024)', which are large language models. However, it does not provide concrete access information (link, DOI, repository, or formal citation for the specific dataset used for training these models) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific details about training/test/validation dataset splits, such as percentages, sample counts, or references to predefined splits, for the models used in the experiments. As no specific public dataset was identified, information on its splits is also absent. |
| Hardware Specification | Yes | Our experiments run on up to 32 NVIDIA A100 80G GPUs on 4 nodes interconnected by Ro CE RDMA network. |
| Software Dependencies | No | The paper mentions 'Megatron-LM' as the base for GPT-3-like models and refers to 'pytorch' and 'CUDA events' in Appendix C. However, it does not specify version numbers for these or any other key software components, libraries, or programming languages. |
| Experiment Setup | Yes | We evaluate our methods on GPT-3-like models based on Megatron-LM (Narayanan et al., 2021). In most cases, one transformer layer is removed from both the first and last pipeline stages to address imbalances caused by vocabulary layers, similar to Llama 3 (Dubey et al., 2024) and Deepseek v3 (Liu et al., 2024). The models used are listed in Table 2. ... For all models we turn on GQA (Ainslie et al., 2023) with number of query group set to 8. ... For all schedules except 1F1B, we set the number of stages on each device to the maximum possible value so that each stage has at most 1 transformer layer, unless explicitly specified. |