Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
Authors: Rickard BrĂ¼el Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with up to 1000 Lo RAs demonstrate that compressed Lo RAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand Lo RAs, maintaining 80% of the throughput of serving a single Lo RA. We train a collection of more than 1000 high-quality Lo RAs for Mistral-7B-Instruct-v0.2 (Jiang et al., 2023a) on 1000 natural instruction tasks (Wang et al., 2022) and demonstrate that our compression techniques preserve the performance of the original Lo RAs. We incorporate Lo RA compression into a state-of-the-art LLM serving system and demonstrate that it is possible to serve over 1000 Lo RAs across thousands of asynchronous requests with throughput comparable to serving a single Lo RA. |
| Researcher Affiliation | Collaboration | 1MIT CSAIL 2MIT-IBM Watson AI Lab. |
| Pseudocode | Yes | Listing 1: Pseudocode for add_lora_slice_with_sigma |
| Open Source Code | No | We will release over a 1000 Lo RAs to facilitate future work as well as the code for our method. |
| Open Datasets | Yes | We trained Lo RA adapters on 1000 natural instruction tasks (Wang et al., 2022) using Mistral-7B-Instruct-v0.2 (Jiang et al., 2023a) as the base. |
| Dataset Splits | Yes | Each task dataset was divided into training, validation, and test sets (80-10-10). |
| Hardware Specification | Yes | Experiments were conducted on H100 80GB GPU capped at 40% memory consumption to reflect situations where a service provider might want to serve many Lo RAs from cheaper hardware with lower memory than higher-end GPUs. |
| Software Dependencies | No | We use Huggingface (Wolf et al., 2020) in our implementation. Lora Config( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", init_lora_weights=init_lora_weights, ) Bits And Bytes Config( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) |
| Experiment Setup | Yes | We set all Lo RA adapter ranks to 16 (i.e., i, ri = 16), except for those in our ablation study (Appendix H.1), where we vary the Lo RA rank. All Lo RA adapters were configured with a rank of 16, i.e., i, ri = 16. We selected 10 diverse tasks (Table 2 in Appendix C) manually for consistent evaluation across experiments and randomly sampled an additional 990 tasks, resulting in a total of 1000 tasks (Table 3). The tasks went through a robust reviewing protocol to ensure high quality and diversity. Each task data was divided into training, validation, and test sets. Hyperparameters, such as early stopping, were tuned using the validation sets. |