Palu: KV-Cache Compression with Low-Rank Projection
Authors: Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed Abdelfattah, Kai-Chiang Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with popular LLMs show that Palu compresses KV-Cache by 50%, while maintaining strong accuracy and delivering up to 1.89 speedup on the Ro PE-based attention module. When combined with quantization, Palu s inherent quantization-friendly design yields small to negligible extra accuracy degradation, while saving additional memory than quantizationonly methods and achieving up to 2.91 speedup for the Ro PE-based attention. Moreover, it maintains comparable or even better accuracy (up to 1.19 lower perplexity) compared to quantization-only methods. These results demonstrate Palu s superior capability to effectively address the efficiency and memory challenges of LLM inference posed by KV-Cache. Our code is publicly available at: https://github.com/shadowpa0327/Palu. |
| Researcher Affiliation | Academia | Chi-Chih Chang1,3 Wei-Cheng Lin1 Chien-Yu Lin2 Chong-Yan Chen1 Yu-Fang Hu1 Pei-Shuo Wang1 Ning-Chi Huang1 Luis Ceze2 Mohamed S. Abdelfattah3 Kai-Chiang Wu1 1National Yang Ming Chiao Tung University 2University of Washington 3Cornell University |
| Pseudocode | No | The paper includes detailed descriptions of algorithms and kernels, for example, in Appendix B "KERNEL IMPLEMENTATION DETAILS" and Appendix K.1 "TRUNCATION-AWARE SVD WITH WHITENING TRANSFORMATION", but none of these sections are explicitly labeled as "Pseudocode" or "Algorithm" and are presented in a narrative or step-by-step prose format rather than a structured pseudocode block. |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/shadowpa0327/Palu. |
| Open Datasets | Yes | We evaluate Palu on four LLM families, Llama-2 (Touvron et al., 2023), Llama-3 (Dubey et al., 2024), Mistral (Jiang et al., 2023) and Long Chat (Li et al., 2023). For accuracy evaluation, we measure perplexity on the Wiki Text-2 (Merity et al., 2016) and C4 (Raffel et al., 2020) datasets and use LM-Evaluation-Harness (Gao et al., 2023) to measure zero-shot accuracy on six common sense tasks. We also evaluate long context accuracy on 16 tasks in Long Bench (Bai et al., 2023). |
| Dataset Splits | No | The paper mentions using specific datasets like Wiki Text-2, C4, and Long Bench, and specifies sequence lengths for evaluation (e.g., "For perplexity, sequence length is 4096"; "2048 random samples from Wikitext-2, each with a sequence length of 1024" for Fisher information calculation; "maximum sequence length to 31500" for Long Bench). However, it does not explicitly provide specific training/test/validation percentages or absolute sample counts for these datasets as applied to its own experimental setup, instead relying on standard benchmarks and evaluation harnesses. |
| Hardware Specification | Yes | We measure decode latency on a single RTX 4090 GPU and compare Palu to the FP16 and KIVI-4-bit baselines. |
| Software Dependencies | No | We implemented Palu based on the Huggingface library (Wolf et al., 2020). Decomposition of the Key and Value projection layers was performed using the truncationaware SVD method proposed by SVD-LLM (Wang et al., 2024). We implemented a customized kernel for attention score with reconstruction in Triton (Tillet et al., 2019) (See Appendix B). For quantization integration, we implemented kernels in CUDA for attention output and non-Ro PE attention score, where matrix fusion can be applied (refer to Sec. 3.1.1 and Fig. 2). While software components like Huggingface, SVD-LLM, Triton, and CUDA are mentioned, specific version numbers for these libraries or frameworks are not provided. |
| Experiment Setup | Yes | Unless otherwise specified, Palu s results are G-LRD with a group size of 4 (gs-4), with equal rank size for each group. To calculate Fisher information in rank searching, we used 2048 random samples from Wikitext-2, each with a sequence length of 1024. For quantization integration in Palu, we use a simple per-token, asymmetric integer quantization. To integrate Lo RA with Palu, we introduce additional low-rank matrices A r Rd r and B r Rr d to refine the original low-rank projection as below: ... We sample 8k samples from the Alpaca training dataset as a fine-tuning dataset and apply Lo RA with rank r = 32 and α = 32. All other hyper-parameters are aligned with Ashkboos et al. (2024a), except for the learning rate 2e 4, and the use of a cosine learning rate scheduler. |