Pfeife: Automatic Pipeline Parallelism for PyTorch
Authors: Ho Young Jhoo, Chung-Kil Hur, Nuno P. Lopes
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Pfeife in three ways: (1) applicability of the approach, (2) accuracy of cost estimations, and (3) end-to-end performance comparison with existing frameworks. ... Table 1: Throughput comparison of pipeline parallelism (item/s). |
| Researcher Affiliation | Collaboration | 1Seoul National University, Republic of Korea 2INESC-ID / Instituto Superior T ecnico University of Lisbon, Portugal 3Furiosa AI, Republic of Korea. |
| Pseudocode | Yes | Algorithm 1 shows the pseudo-code. ... Algorithm 1 Graph-schedule co-optimization. |
| Open Source Code | Yes | Pfeife1 Available at https://github.com/Mer HS/pfeife. |
| Open Datasets | Yes | We used Torch Bench (Hao et al., 2023), which is the official Py Torch benchmark suite. It includes a wide range of models. ... Vision Transformer (Vi T-g/14) (Zhai et al., 2022) and GPT2-large (Radford et al., 2019) ... Llama2-7B) (Touvron et al., 2023), and a diffusion model (Stable Diffusion-XL) (Podell et al., 2023) |
| Dataset Splits | No | The paper refers to using datasets like Torch Bench, Vi T-g/14, Llama2-7B, and Stable Diffusion-XL. However, it does not explicitly provide details on how these datasets were split into training, validation, or test sets for the experiments conducted in this paper. It mentions "mini-batch size" and "total batch count" but not data partitioning for evaluation. |
| Hardware Specification | Yes | For coverage and correctness, we used a small server with 8x NVIDIA RTX 3090 24 Gi B GPUs with 4 NVLink connections. For the end-to-end experiments, we used a larger server with 8x A100 40GB GPUs with NVSwitch. |
| Software Dependencies | Yes | ML models are written in plain Py Torch. They are then compiled using Py Torch 2 s torch.compile (Ansel et al., 2024), as it is now common. |
| Experiment Setup | Yes | Listing 1 shows an example of the full code required to train a model with Pfeife... optimizer = torch.optim.Adam(main_model.parameters(), lr=1e-5) criterion = torch.nn.Cross Entropy Loss() ... (B) Total batch count: Number of mini-batches (Nl) Loop count: How many times the forward loop is executed. (Bl) Loop batch count: How many mini-batches go through the forward pass of a single stage. ( Bf) Prefetch batch count: A list with the number of forward passes each device runs in addition to Bl before it runs its first backward pass. | Bf| = |D|. |