SpinQuant: LLM Quantization with Learned Rotations

Authors: Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To rigorously assess the effectiveness of Spin Quant, we executed comprehensive experiments across seven leading Large Language Models (LLMs), including LLa MA-2(Touvron et al., 2023b) models (7B/13B/70B), LLa MA-3(AI@Meta, 2024) models (1B/3B/8B), and the Mistral (Jiang et al., 2023) 7B model. The key contributions of this study are summarized as follows: We introduce Spin Quant, the first method that employs learned rotations to mitigate outliers in weight and activation distributions, boosting the performance of quantized LLMs. We reveal that random rotations introduce substantial variance in quantized network performance. We propose optimizing rotation matrices within Stiefel manifold, directly minimizing the final loss of rotated quantized network. Ablation studies validate that our learned rotations consistently outperform random rotations, with improvements up to 16.2 points.
Researcher Affiliation Industry Zechun Liu Changsheng Zhao Igor Fedorov Bilge Soran Dhruv Choudhary Raghuraman Krishnamoorthi Vikas Chandra Yuandong Tian Tijmen Blankevoort Correspondence to: Zechun Liu <EMAIL>.
Pseudocode No The paper describes mathematical formulations for the optimization process (Eq. 2 and Eq. 3) and textual descriptions of the rotation strategies, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at github.com/facebookresearch/Spin Quant.
Open Datasets Yes Our evaluation of the proposed Spin Quant was carried out on eight zero-shot commonsense reasoning tasks. These tasks include Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2021), ARC-easy and ARCchallenge (Clark et al., 2018), and OBQA (Mihaylov et al., 2018). Additionally, we also report the perplexity score on Wiki Text2 testset (Merity et al., 2016) for our evaluation.
Dataset Splits Yes We utilize 800 samples from Wiki Text-2 to optimize rotation for 100 iterations. It takes only 13 / 18 / 30 minutes for LLa MA-3 1B / 3B / 8B, respectively, and 25 / 30 minutes for LLa MA-2 7B / 13B, respectively. For LLa MA-2 70B, it takes 3.5 hours and for Mistral-7B it takes 16 minutes. After rotation is learned, we apply GPTQ on the rotated weights (Frantar et al., 2022), for which we adhere to the standard GPTQ settings by using 128 samples from Wiki Text2 with a sequence length of 2048 as the calibration set for GPTQ quantization.
Hardware Specification Yes We conduct an end-to-end speed measurement of the LLa MA-3 8B model with W16A16 and W4A8 configurations on a Mac Book M1 Pro CPU (OS version 14.5). In light of the available Tensor cores in NVIDIA s Hopper (H100) architecture, we provide the whole network end-to-end speed test result of W-fp8-A-fp8 quantization on H100 GPU, both with and without Hadamard transformations.
Software Dependencies No The paper mentions using "FP8 GEMM from the FBGEMM repo" and a "Tensor Core-based Hadamard transform kernel" with links to their respective repositories. However, it does not specify concrete version numbers for these or other potential software dependencies (e.g., PyTorch, CUDA), which are necessary for reproducible descriptions.
Experiment Setup Yes The learning rate starts at 1.5 and linearly decays to 0. We utilize 800 samples from Wiki Text-2 to optimize rotation for 100 iterations. It takes only 13 / 18 / 30 minutes for LLa MA-3 1B / 3B / 8B, respectively, and 25 / 30 minutes for LLa MA-2 7B / 13B, respectively. In the main results, we optimize the rotation with respect to the activation quantized network, where the weights remain 16-bit. After rotation is learned, we apply GPTQ on the rotated weights (Frantar et al., 2022), for which we adhere to the standard GPTQ settings by using 128 samples from Wiki Text2 with a sequence length of 2048 as the calibration set for GPTQ quantization.