Efficiently Serving Large Multimodal Models Using EPD Disaggregation
Authors: Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Tin Long Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Bai Xiaolong, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15 lower peak memory utilization), batch sizes (up to 22 larger), 10 more images per request, and 2.2 larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90 100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. In this section, we analyze and compare the performance of the proposed EPD disaggregation method against various baselines. |
| Researcher Affiliation | Collaboration | 1Huawei Technologies Canada, BC, Canada 2Simon Fraser University, BC, Canada 3Huawei Cloud, China. |
| Pseudocode | No | The paper describes the system design and optimization techniques in prose, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/vbdi/epdserve. |
| Open Datasets | Yes | Datasets: To evaluate performance across diverse scenarios, we use three datasets: synthetic workload, Next QA, and Video-MME. The synthetic workload enables configurable parameters such as prompt length, number of images per request, image resolution, output length, and sampling settings. Unless otherwise noted, the input prompt length is set to 22 tokens. Next QA (Xiao et al., 2021), a benchmark video question-answering dataset, features human-annotated questions and answers, offering a more realistic reflection of realworld video request distributions compared to the synthetic workload. Video-MME (Fu et al., 2024) is a multimodal evaluation dataset designed for assessing large LMMs on video understanding tasks. |
| Dataset Splits | Yes | In this experiment, we evaluate the goodput of EPD and the baselines in an online setting where 100 multimodal requests arrive following a Poisson process with rate λ. Next, we repeat the experiment using the non-synthetic video question-answering dataset, Next QA (Xiao et al., 2021). To do so, we randomly sampled 100 examples, with input text token lengths ranging from 4 to 21 (average: 11.42) and output token lengths ranging from 1 to 7 (average: 2.75). Finally, we conduct the same experiment using the Video MME (Fu et al., 2024) dataset. We evaluate SLO attainment, defined as TTFT 3.1s and TPOT 0.025s, on 100 randomly sampled Video-MME examples using Mini CPM-V 2.6. |
| Hardware Specification | Yes | We conducted our experiments using a cluster of 8 NVIDIA A100 GPUs (82GB). Each server was equipped with 128 CPUs and 1TB of RAM. |
| Software Dependencies | Yes | The CUDA version was 12.2. Flash attention-2 was used for the attention implementation. Finally, for the v LLM inference engine, we used the version 0.6.1.post1 which represents a stable version for multimodal inference. EPD-NPU also used a version of Ascend-v LLM (version 0.6.3.post1) with CANN version 7.6. |
| Experiment Setup | Yes | Specifically, these include a block size of 16; a maximum of 2048 blocks per request; context tokens capped at 49,152, and decoding tokens at 81,920 per batch. The scheduling policy for all stages was set to First-Come-First-Served (FCFS). Further, to allow enough resources for memory-heavy multimodal requests to execute, KV cache GPU utilization was set to 50%, and the maximum number of multimedia data of 32 was imposed per prompt. The size of the multimodal cache was fixed to 3000 across all models, and the v LLM inference engine was run in eager mode. In our online experiments, requests were sent to the inference engine using a Poisson arrival process with a fixed λ, representing the number of requests per second. |