QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

Authors: Hamidreza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA s multi-instance GPU (MIG). Furthermore, experiments on Google s Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, The George Washington University 2Computer Science and Engineering Department, Texas A&M University.
Pseudocode Yes Algorithm 1 Initial Expert Loader Algorithm 2 Inference Orchestrator
Open Source Code Yes https://github.com/hamid-74/Multi-Mo E
Open Datasets Yes We evaluate our serving system using the Mixtral-8x7B-v0.1 and Google Switch Transformer Base-8 family of models. ... For the base model, we evaluate the quality of text generation using the Wiki Text2 (Merity et al., 2016), PTB (Marcus et al., 1994), and C4 (Raffel et al., 2020) datasets. ... On MT-Bench, the proposed model outperforms others with ... MT-Bench (Zheng et al., 2023) benchmarks. ... Hella Swag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), and Truthful QA (Lin et al., 2021b) benchmarks. ... we report the ROUGE-1 scores (an N-gram based summarization evaluation metric) for a summarization task on the SAMSum dataset (Gliwa et al., 2019)
Dataset Splits No The paper describes evaluation metrics and sample sizes for specific tasks (e.g., "perplexity using 128 samples, with each sample consisting of 2048 tokens", "evaluate each model with 1000 random samples" for MMLU and Hella Swag, "812 samples" for Truthful QA, "80 instructions" for MT-Bench), but it does not provide explicit details about how the datasets themselves were partitioned into training, validation, or test sets for reproduction, nor does it refer to specific predefined splits with citations for these evaluations beyond the initial dataset citations.
Hardware Specification Yes Our experiments are conducted on a server with an AMD 16-Core MILAN CPU and a single 80GB NVIDIA A100 GPU, connected via a PCIe Gen4 host-to-device interconnect, as outlined in Section 2.
Software Dependencies No We implement our loader and orchestrator using the Py Torch (Paszke et al., 2019) framework. Our experiments are conducted on a server with an AMD 16-Core MILAN CPU and a single 80GB NVIDIA A100 GPU, connected via a PCIe Gen4 host-to-device interconnect, as outlined in Section 2. Qo S metrics are measured from Py Torch s perspective using Python s time package.
Experiment Setup Yes Each request to the system is a prompt consisting of 20 tokens, with the number of requested output tokens set to 25. To compare the Qo S of each approach, we measure the average TTFT, average turnaround time, and the total number of processed requests (throughput). ... To demonstrate the system s responsiveness and performance under varying loads, we evaluate the system at different arrival rates (λ). ... We focus on single-batch requests, providing a controlled setting for our experiments. ... For each dataset, we assess the models generated output by measuring perplexity using 128 samples, with each sample consisting of 2048 tokens. ... For MMLU and Truthful QA, we use 5-shot prompts, and for Hella Swag, we use 0-shot prompts. For MMLU and Hella Swag, we evaluate each model with 1000 random samples, and for Truthful QA, we evaluate them with 812 samples.