Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving

Authors: Changhai Zhou, Yuhua Zhou, Shiyang Zhang, Yibin Wang, Zekai Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that Dop can improve throughput by 1.30-1.46 times in a SOTA multi-tenant Lo RA serving. ... The experiments were conducted with batch sizes ranging from 1 to 64. Each configuration was tested 1,000 times, and the average latency was recorded. ... Our results consistently show that Dop outperforms the existing solutions in both the Lo RA operator microbenchmark and text generation throughput.
Researcher Affiliation Academia 1 School of Computer Science, Fudan University 2 College of Computer Science and Technology, Zhejiang University 3 Columbia University 4 Zhejiang Lab EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the Dynamic Operator Optimization (Dop) method and its components (Search Space Constructor, Optimization Engine) in detail, but it does so using descriptive text and flowcharts (Figure 2), not formal pseudocode or algorithm blocks.
Open Source Code No The paper mentions using 'Hugging Face Transformers (Wolf et al. 2020) library' and 'Hugging Face PEFT library (Mangrulkar et al. 2022)' for Llama-2 models but does not provide any statement or link for the release of the authors' own implementation code for the methodology described.
Open Datasets Yes In this study, we evaluate our method using the Llama-2 models (Touvron et al. 2023) with 7B and 13B parameters.
Dataset Splits No The paper discusses 'workload types' such as Distinct, Uniform, Skewed, and Identical request distributions, and mentions testing with '1,000 requests' for throughput evaluation. However, it does not provide specific details on training/test/validation dataset splits, as its focus is on model serving performance rather than training.
Hardware Specification Yes The hardware used includes NVIDIA A100 40GB and NVIDIA RTX 3090 GPUs.
Software Dependencies Yes All experiments are conducted on Ubuntu with Py Torch 2.1.2 and CUDA 12.4. The Llama-2 models are implemented using the Hugging Face Transformers (Wolf et al. 2020) library, with Lo RA weights integrated via the Hugging Face PEFT library (Mangrulkar et al. 2022).
Experiment Setup Yes For each scenario, Dop was executed with 300 mutation iterations, with the entire Dop execution taking approximately 1.5 hours. ... The maximum batch size was set to 32, and all systems processed requests in a first-come, first-served manner. ... The experiments were conducted with batch sizes ranging from 1 to 64.