reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FlatQuant: Flatness Matters for LLM Quantization

Authors: Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that FLATQUANT establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1% accuracy drop for W4A4 quantization on the LLa MA-3-70B model, surpassing Spin Quant by 7.5%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model. Code is available at: https:// github.com/ruikangliu/Flat Quant.
Researcher Affiliation	Collaboration	Yuxuan Sun * 1 Ruikang Liu * 2 Haoli Bai 1 Han Bao 1 Kang Zhao 1 Yuening Li 3 Jiaxin Hu 1 Xianzhi Yu 1 Lu Hou 1 Chun Yuan 1 Xin Jiang 1 Wulong Liu 1 Jun Yao 1 1Huawei Noah s Ark Lab 2Shenzhen International Graduate School, Tsinghua University 3The Chinese University of Hong Kong.
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided. The methodology is described in prose, for example, in Section 3.3, 'Efficient Kernel Design'.
Open Source Code	Yes	Code is available at: https:// github.com/ruikangliu/Flat Quant.
Open Datasets	Yes	We report the perplexity (PPL) of language generation tasks on the Wiki Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020) datasets. For commonsense reasoning tasks, we use six zero-shot evaluation tasks, including ARC-Challenge, ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), and Wino Grande (Sakaguchi et al., 2021).
Dataset Splits	Yes	FLATQUANT is trained for 15 epochs on a calibration set comprising 128 sentences from Wiki Text-2, each sampled with 2048 tokens. The batch size is set to 4. The default calibration procedure costs approximately 26GB of GPU memory and about 0.9 hours for LLa MA-3-8B on a single GPU.
Hardware Specification	Yes	All experiments of inference latency below are conducted on the RTX3090 GPU.
Software Dependencies	No	We implement FLATQUANT based on Huggingface (Wolf, 2019) and Py Torch (Paszke et al., 2019). We adopt the Adam W optimizer with an initial learning rate of 5e-3 and employ a cosine annealing learning rate decay schedule. The learning rate for clipping thresholds is 5e-2. FLATQUANT is trained for 15 epochs on a calibration set comprising 128 sentences from Wiki Text-2, each sampled with 2048 tokens. The batch size is set to 4. The default calibration procedure costs approximately 26GB of GPU memory and about 0.9 hours for LLa MA-3-8B on a single GPU. FLATQUANT is robust to initialization, and we employ random affine transformation matrices as the starting point. Further details about implementation and calibration time are provided in Appendix B.1.
Experiment Setup	Yes	We adopt the Adam W optimizer with an initial learning rate of 5e-3 and employ a cosine annealing learning rate decay schedule. The learning rate for clipping thresholds is 5e-2. FLATQUANT is trained for 15 epochs on a calibration set comprising 128 sentences from Wiki Text-2, each sampled with 2048 tokens. The batch size is set to 4.