reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CommVQ: Commutative Vector Quantization for KV Cache Compression

Authors: Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods.
Researcher Affiliation	Collaboration	1University of Massachusetts Amherst 2Massachusetts Institute of Technology 3Princeton University 4Apple Inc. Correspondence to: Junyan Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 EM Algorithm for Learning Ro PE Commutative Codebook.
Open Source Code	Yes	The source code is available at: https://github. com/UMass-Embodied-AGI/Comm VQ.
Open Datasets	Yes	A subset of the Fine Web-Edu dataset (Lozhkov et al., 2024) is used to learn the encoder and codebooks. Extensive evaluation on two long-context benchmarks, Long Bench (Bai et al., 2023) and Infinite Bench (Zhang et al., 2024b), as well as GSM8K (Cobbe et al., 2021).
Dataset Splits	No	The paper mentions using a "subset of the Fine Web-Edu dataset" for calibration and evaluates on established benchmarks (Long Bench, Infinite Bench, GSM8K). While these benchmarks typically have standard splits, the paper does not explicitly provide specific percentages, sample counts, or detailed methodology for how the Fine Web-Edu subset was created or partitioned, nor does it explicitly cite the specific splits used for the benchmarks.
Hardware Specification	Yes	Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLa MA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. Figure 3 highlights the real per-token decoding memory savings using the LLa MA-3.1-8B-Instruct model, measured on an H100-80GB GPU.
Software Dependencies	No	The paper mentions "Gumbel-softmax" as an activation function and that a "Triton implementation" and "Triton kernels" were used. However, it does not provide specific version numbers for any software, libraries, or programming languages used in the experiments.
Experiment Setup	Yes	We evaluate Comm VQ using the latest LLa MA3.1-8B-Instruct model (Dubey et al., 2024). For VQLLM, we set C = 256, K = 8 for 2-bit and C = 256, K = 4 for 1-bit quantization. In conclusion, a larger g and larger R will lead to better quantization accuracy, but also lead to higher computation and lower compression rate. As a result, to achieve a good trade-off between quantization accuracy, computation cost and compression rate, we set g = 64, Nc = 64 for all our main experiments, and R = 11 for 1-bit quantization and R = 21 for 2-bit quantization respectively.