reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Authors: Rickard Brüel Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments with up to 1000 Lo RAs demonstrate that compressed Lo RAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand Lo RAs, maintaining 80% of the throughput of serving a single Lo RA. We train a collection of more than 1000 high-quality Lo RAs for Mistral-7B-Instruct-v0.2 (Jiang et al., 2023a) on 1000 natural instruction tasks (Wang et al., 2022) and demonstrate that our compression techniques preserve the performance of the original Lo RAs. We incorporate Lo RA compression into a state-of-the-art LLM serving system and demonstrate that it is possible to serve over 1000 Lo RAs across thousands of asynchronous requests with throughput comparable to serving a single Lo RA.
Researcher Affiliation	Collaboration	1MIT CSAIL 2MIT-IBM Watson AI Lab.
Pseudocode	Yes	Listing 1: Pseudocode for add_lora_slice_with_sigma
Open Source Code	No	We will release over a 1000 Lo RAs to facilitate future work as well as the code for our method.
Open Datasets	Yes	We trained Lo RA adapters on 1000 natural instruction tasks (Wang et al., 2022) using Mistral-7B-Instruct-v0.2 (Jiang et al., 2023a) as the base.
Dataset Splits	Yes	Each task dataset was divided into training, validation, and test sets (80-10-10).
Hardware Specification	Yes	Experiments were conducted on H100 80GB GPU capped at 40% memory consumption to reflect situations where a service provider might want to serve many Lo RAs from cheaper hardware with lower memory than higher-end GPUs.
Software Dependencies	No	We use Huggingface (Wolf et al., 2020) in our implementation. Lora Config( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", init_lora_weights=init_lora_weights, ) Bits And Bytes Config( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
Experiment Setup	Yes	We set all Lo RA adapter ranks to 16 (i.e., i, ri = 16), except for those in our ablation study (Appendix H.1), where we vary the Lo RA rank. All Lo RA adapters were configured with a rank of 16, i.e., i, ri = 16. We selected 10 diverse tasks (Table 2 in Appendix C) manually for consistent evaluation across experiments and randomly sampled an additional 990 tasks, resulting in a total of 1000 tasks (Table 3). The tasks went through a robust reviewing protocol to ensure high quality and diversity. Each task data was divided into training, validation, and test sets. Hyperparameters, such as early stopping, were tuned using the validation sets.