Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving
Authors: Changhai Zhou, Yuhua Zhou, Shiyang Zhang, Yibin Wang, Zekai Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that Dop can improve throughput by 1.30-1.46 times in a SOTA multi-tenant Lo RA serving. ... The experiments were conducted with batch sizes ranging from 1 to 64. Each configuration was tested 1,000 times, and the average latency was recorded. ... Our results consistently show that Dop outperforms the existing solutions in both the Lo RA operator microbenchmark and text generation throughput. |
| Researcher Affiliation | Academia | 1 School of Computer Science, Fudan University 2 College of Computer Science and Technology, Zhejiang University 3 Columbia University 4 Zhejiang Lab EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the Dynamic Operator Optimization (Dop) method and its components (Search Space Constructor, Optimization Engine) in detail, but it does so using descriptive text and flowcharts (Figure 2), not formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using 'Hugging Face Transformers (Wolf et al. 2020) library' and 'Hugging Face PEFT library (Mangrulkar et al. 2022)' for Llama-2 models but does not provide any statement or link for the release of the authors' own implementation code for the methodology described. |
| Open Datasets | Yes | In this study, we evaluate our method using the Llama-2 models (Touvron et al. 2023) with 7B and 13B parameters. |
| Dataset Splits | No | The paper discusses 'workload types' such as Distinct, Uniform, Skewed, and Identical request distributions, and mentions testing with '1,000 requests' for throughput evaluation. However, it does not provide specific details on training/test/validation dataset splits, as its focus is on model serving performance rather than training. |
| Hardware Specification | Yes | The hardware used includes NVIDIA A100 40GB and NVIDIA RTX 3090 GPUs. |
| Software Dependencies | Yes | All experiments are conducted on Ubuntu with Py Torch 2.1.2 and CUDA 12.4. The Llama-2 models are implemented using the Hugging Face Transformers (Wolf et al. 2020) library, with Lo RA weights integrated via the Hugging Face PEFT library (Mangrulkar et al. 2022). |
| Experiment Setup | Yes | For each scenario, Dop was executed with 300 mutation iterations, with the entire Dop execution taking approximately 1.5 hours. ... The maximum batch size was set to 32, and all systems processed requests in a first-come, first-served manner. ... The experiments were conducted with batch sizes ranging from 1 to 64. |