FlatQuant: Flatness Matters for LLM Quantization
Authors: Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that FLATQUANT establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1% accuracy drop for W4A4 quantization on the LLa MA-3-70B model, surpassing Spin Quant by 7.5%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model. Code is available at: https:// github.com/ruikangliu/Flat Quant. |
| Researcher Affiliation | Collaboration | Yuxuan Sun * 1 Ruikang Liu * 2 Haoli Bai 1 Han Bao 1 Kang Zhao 1 Yuening Li 3 Jiaxin Hu 1 Xianzhi Yu 1 Lu Hou 1 Chun Yuan 1 Xin Jiang 1 Wulong Liu 1 Jun Yao 1 1Huawei Noah s Ark Lab 2Shenzhen International Graduate School, Tsinghua University 3The Chinese University of Hong Kong. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided. The methodology is described in prose, for example, in Section 3.3, 'Efficient Kernel Design'. |
| Open Source Code | Yes | Code is available at: https:// github.com/ruikangliu/Flat Quant. |
| Open Datasets | Yes | We report the perplexity (PPL) of language generation tasks on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) datasets. For commonsense reasoning tasks, we use six zero-shot evaluation tasks, including ARC-Challenge, ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), and Wino Grande (Sakaguchi et al., 2021). |
| Dataset Splits | Yes | FLATQUANT is trained for 15 epochs on a calibration set comprising 128 sentences from Wiki Text-2, each sampled with 2048 tokens. The batch size is set to 4. The default calibration procedure costs approximately 26GB of GPU memory and about 0.9 hours for LLa MA-3-8B on a single GPU. |
| Hardware Specification | Yes | All experiments of inference latency below are conducted on the RTX3090 GPU. |
| Software Dependencies | No | We implement FLATQUANT based on Huggingface (Wolf, 2019) and Py Torch (Paszke et al., 2019). We adopt the Adam W optimizer with an initial learning rate of 5e-3 and employ a cosine annealing learning rate decay schedule. The learning rate for clipping thresholds is 5e-2. FLATQUANT is trained for 15 epochs on a calibration set comprising 128 sentences from Wiki Text-2, each sampled with 2048 tokens. The batch size is set to 4. The default calibration procedure costs approximately 26GB of GPU memory and about 0.9 hours for LLa MA-3-8B on a single GPU. FLATQUANT is robust to initialization, and we employ random affine transformation matrices as the starting point. Further details about implementation and calibration time are provided in Appendix B.1. |
| Experiment Setup | Yes | We adopt the Adam W optimizer with an initial learning rate of 5e-3 and employ a cosine annealing learning rate decay schedule. The learning rate for clipping thresholds is 5e-2. FLATQUANT is trained for 15 epochs on a calibration set comprising 128 sentences from Wiki Text-2, each sampled with 2048 tokens. The batch size is set to 4. |