Palu: KV-Cache Compression with Low-Rank Projection

Authors: Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed Abdelfattah, Kai-Chiang Wu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with popular LLMs show that Palu compresses KV-Cache by 50%, while maintaining strong accuracy and delivering up to 1.89 speedup on the Ro PE-based attention module. When combined with quantization, Palu s inherent quantization-friendly design yields small to negligible extra accuracy degradation, while saving additional memory than quantizationonly methods and achieving up to 2.91 speedup for the Ro PE-based attention. Moreover, it maintains comparable or even better accuracy (up to 1.19 lower perplexity) compared to quantization-only methods. These results demonstrate Palu s superior capability to effectively address the efficiency and memory challenges of LLM inference posed by KV-Cache. Our code is publicly available at: https://github.com/shadowpa0327/Palu.
Researcher Affiliation Academia Chi-Chih Chang1,3 Wei-Cheng Lin1 Chien-Yu Lin2 Chong-Yan Chen1 Yu-Fang Hu1 Pei-Shuo Wang1 Ning-Chi Huang1 Luis Ceze2 Mohamed S. Abdelfattah3 Kai-Chiang Wu1 1National Yang Ming Chiao Tung University 2University of Washington 3Cornell University
Pseudocode No The paper includes detailed descriptions of algorithms and kernels, for example, in Appendix B "KERNEL IMPLEMENTATION DETAILS" and Appendix K.1 "TRUNCATION-AWARE SVD WITH WHITENING TRANSFORMATION", but none of these sections are explicitly labeled as "Pseudocode" or "Algorithm" and are presented in a narrative or step-by-step prose format rather than a structured pseudocode block.
Open Source Code Yes Our code is publicly available at: https://github.com/shadowpa0327/Palu.
Open Datasets Yes We evaluate Palu on four LLM families, Llama-2 (Touvron et al., 2023), Llama-3 (Dubey et al., 2024), Mistral (Jiang et al., 2023) and Long Chat (Li et al., 2023). For accuracy evaluation, we measure perplexity on the Wiki Text-2 (Merity et al., 2016) and C4 (Raffel et al., 2020) datasets and use LM-Evaluation-Harness (Gao et al., 2023) to measure zero-shot accuracy on six common sense tasks. We also evaluate long context accuracy on 16 tasks in Long Bench (Bai et al., 2023).
Dataset Splits No The paper mentions using specific datasets like Wiki Text-2, C4, and Long Bench, and specifies sequence lengths for evaluation (e.g., "For perplexity, sequence length is 4096"; "2048 random samples from Wikitext-2, each with a sequence length of 1024" for Fisher information calculation; "maximum sequence length to 31500" for Long Bench). However, it does not explicitly provide specific training/test/validation percentages or absolute sample counts for these datasets as applied to its own experimental setup, instead relying on standard benchmarks and evaluation harnesses.
Hardware Specification Yes We measure decode latency on a single RTX 4090 GPU and compare Palu to the FP16 and KIVI-4-bit baselines.
Software Dependencies No We implemented Palu based on the Huggingface library (Wolf et al., 2020). Decomposition of the Key and Value projection layers was performed using the truncationaware SVD method proposed by SVD-LLM (Wang et al., 2024). We implemented a customized kernel for attention score with reconstruction in Triton (Tillet et al., 2019) (See Appendix B). For quantization integration, we implemented kernels in CUDA for attention output and non-Ro PE attention score, where matrix fusion can be applied (refer to Sec. 3.1.1 and Fig. 2). While software components like Huggingface, SVD-LLM, Triton, and CUDA are mentioned, specific version numbers for these libraries or frameworks are not provided.
Experiment Setup Yes Unless otherwise specified, Palu s results are G-LRD with a group size of 4 (gs-4), with equal rank size for each group. To calculate Fisher information in rank searching, we used 2048 random samples from Wikitext-2, each with a sequence length of 1024. For quantization integration in Palu, we use a simple per-token, asymmetric integer quantization. To integrate Lo RA with Palu, we introduce additional low-rank matrices A r Rd r and B r Rr d to refine the original low-rank projection as below: ... We sample 8k samples from the Alpaca training dataset as a fine-tuning dataset and apply Lo RA with rank r = 32 and α = 32. All other hyper-parameters are aligned with Ashkboos et al. (2024a), except for the learning rate 2e 4, and the use of a cosine learning rate scheduler.