Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Authors: Rickard BrĂ¼el Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with up to 1000 Lo RAs demonstrate that compressed Lo RAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand Lo RAs, maintaining 80% of the throughput of serving a single Lo RA. We train a collection of more than 1000 high-quality Lo RAs for Mistral-7B-Instruct-v0.2 (Jiang et al., 2023a) on 1000 natural instruction tasks (Wang et al., 2022) and demonstrate that our compression techniques preserve the performance of the original Lo RAs. We incorporate Lo RA compression into a state-of-the-art LLM serving system and demonstrate that it is possible to serve over 1000 Lo RAs across thousands of asynchronous requests with throughput comparable to serving a single Lo RA.
Researcher Affiliation Collaboration 1MIT CSAIL 2MIT-IBM Watson AI Lab.
Pseudocode Yes Listing 1: Pseudocode for add_lora_slice_with_sigma
Open Source Code No We will release over a 1000 Lo RAs to facilitate future work as well as the code for our method.
Open Datasets Yes We trained Lo RA adapters on 1000 natural instruction tasks (Wang et al., 2022) using Mistral-7B-Instruct-v0.2 (Jiang et al., 2023a) as the base.
Dataset Splits Yes Each task dataset was divided into training, validation, and test sets (80-10-10).
Hardware Specification Yes Experiments were conducted on H100 80GB GPU capped at 40% memory consumption to reflect situations where a service provider might want to serve many Lo RAs from cheaper hardware with lower memory than higher-end GPUs.
Software Dependencies No We use Huggingface (Wolf et al., 2020) in our implementation. Lora Config( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", init_lora_weights=init_lora_weights, ) Bits And Bytes Config( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
Experiment Setup Yes We set all Lo RA adapter ranks to 16 (i.e., i, ri = 16), except for those in our ablation study (Appendix H.1), where we vary the Lo RA rank. All Lo RA adapters were configured with a rank of 16, i.e., i, ri = 16. We selected 10 diverse tasks (Table 2 in Appendix C) manually for consistent evaluation across experiments and randomly sampled an additional 990 tasks, resulting in a total of 1000 tasks (Table 3). The tasks went through a robust reviewing protocol to ensure high quality and diversity. Each task data was divided into training, validation, and test sets. Hyperparameters, such as early stopping, were tuned using the validation sets.