LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Authors: Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations across 4 LLMs and 4 datasets show that Low RA achieves a superior performance precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit Lo RA fine-tuning for resource-constrained environments. We extensively evaluate Low RA across 4 LLMs and 4 tasks, benchmarking against state-of-the-art baselines.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University, Stanford, USA. Correspondence to: Zikai Zhou <EMAIL>.
Pseudocode Yes Algorithm 1 Channelwise Precision Assignment
Open Source Code No Open-Source Release: We will open-source Low RA upon publication to foster further research in ultra-low-bit Lo RA fine-tuning.
Open Datasets Yes We use standard datasets across different NLP tasks: Wiki Text-2 (Merity et al., 2016) (language modeling, perplexity), Open Assistant (K opf et al., 2024) (multi-turn conversation, perplexity), XSUM (Narayan et al., 2018) (summarization, ROUGE scores), and CNN/Daily Mail (Hermann et al., 2015) (summarization, ROUGE scores).
Dataset Splits No Each dataset is evaluated using the standard metrics used in prior work. For fine-tuning, we follow QLo RA s setup of using a batch size of 1 and sequence length of 512.
Hardware Specification Yes Hardware Platform Experiments are conducted on NVIDIA A100 GPUs (80GB memory). Each LLa MA experiment runs on a single dedicated GPU. Each BART-large experiment runs two instances concurrently on a single GPU. Across LLa MA-7B on an RTX 3080, 1.5-bit Low RA delivers a 3.42 throughput increase over QLo RA (32.16 vs. 9.40 tokens per second). On LLa MA-13B with an RTX A4000, the same 1.5-bit configuration still yields a 1.39 speed-up. We benchmarked Low RA at multiple bit-widths against QLo RA (4 bit) on an RTX A5000 (24 GB) and an A100 (80 GB).
Software Dependencies No We build our two-level ILP pipeline using the opensourced Coin-Or Branch and Cut (CBC) (Saltzman, 2002) solver via the Python-based modeling library Pu LP (Mitchell et al., 2011). We integrate this into the bitsandbytes8 library for usability.
Experiment Setup Yes Hyperparameters For a fair comparison, we use identical hyperparameters across all methods, consistent with QLo RA (Dettmers et al., 2024) and Loft Q (Li et al., 2023). Details on selected hyperparameters are in Appendix I. Appendix I provides Table 10 (Hyperparameters used for all LLa MA experiments on Wikitext-2), Table 11 (Hyperparameters for fine-tuning Bart-Large on CNN/Daily Mail), Table 12 (Hyperparameters for fine-tuning Bart-Large on XSUM), and Table 13 (Hyperparameters used for all Llama experiments on Open Assistant (oasst1)), all listing specific values like 'lora r 64', 'lora alpha 64', 'learning rate 0.0003', 'per device train batch size 16', 'gradient accumulation steps 4', 'max steps 126', 'warmup ratio 0.03', etc.