CommVQ: Commutative Vector Quantization for KV Cache Compression

Authors: Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods.
Researcher Affiliation Collaboration 1University of Massachusetts Amherst 2Massachusetts Institute of Technology 3Princeton University 4Apple Inc. Correspondence to: Junyan Li <EMAIL>.
Pseudocode Yes Algorithm 1 EM Algorithm for Learning Ro PE Commutative Codebook.
Open Source Code Yes The source code is available at: https://github. com/UMass-Embodied-AGI/Comm VQ.
Open Datasets Yes A subset of the Fine Web-Edu dataset (Lozhkov et al., 2024) is used to learn the encoder and codebooks. Extensive evaluation on two long-context benchmarks, Long Bench (Bai et al., 2023) and Infinite Bench (Zhang et al., 2024b), as well as GSM8K (Cobbe et al., 2021).
Dataset Splits No The paper mentions using a "subset of the Fine Web-Edu dataset" for calibration and evaluates on established benchmarks (Long Bench, Infinite Bench, GSM8K). While these benchmarks typically have standard splits, the paper does not explicitly provide specific percentages, sample counts, or detailed methodology for how the Fine Web-Edu subset was created or partitioned, nor does it explicitly cite the specific splits used for the benchmarks.
Hardware Specification Yes Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLa MA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. Figure 3 highlights the real per-token decoding memory savings using the LLa MA-3.1-8B-Instruct model, measured on an H100-80GB GPU.
Software Dependencies No The paper mentions "Gumbel-softmax" as an activation function and that a "Triton implementation" and "Triton kernels" were used. However, it does not provide specific version numbers for any software, libraries, or programming languages used in the experiments.
Experiment Setup Yes We evaluate Comm VQ using the latest LLa MA3.1-8B-Instruct model (Dubey et al., 2024). For VQLLM, we set C = 256, K = 8 for 2-bit and C = 256, K = 4 for 1-bit quantization. In conclusion, a larger g and larger R will lead to better quantization accuracy, but also lead to higher computation and lower compression rate. As a result, to achieve a good trade-off between quantization accuracy, computation cost and compression rate, we set g = 64, Nc = 64 for all our main experiments, and R = 11 for 1-bit quantization and R = 21 for 2-bit quantization respectively.