QERA: an Analytical Framework for Quantization Error Reconstruction
Authors: Cheng Zhang, Jeffrey T. H. Wong, Can Xiao, George Constantinides, Yiren Zhao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we revisit this problem and formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem. We show QERA benefits both existing low-precision fine-tuning and inference methods QERA achieves a fine-tuned accuracy gain for acc = 6.05% of 2-bit Ro BERTabase on GLUE compared to Loft Q; and obtains acc = 2.97% higher post-training quantization accuracy of 4-bit Llama-3.1-70B compared to Zero Quant-V2 and ppl = 0.28 lower perplexity on Wiki Text2 compared to LQER. We empirically demonstrate the effectiveness of our solutions by applying them to stateof-the-art QPEFT and PTQ methods. Our analytical framework, QERA, significantly improves the performance of these methods. |
| Researcher Affiliation | Academia | Cheng Zhang, Jeffrey T. H. Wong, Can Xiao, George A. Constantinides & Yiren Zhao Department of Electrical and Electronic Engineering Imperial College London London, UK EMAIL |
| Pseudocode | Yes | A.1 ALGORITHMS IN RELATED WORK Here we summarize the algorithm of Loft Q (Li et al., 2023) in Algorithm 1 and LQER (Zhang et al., 2024a) in Algorithm 2 respectively. LQ-Lo RA (Guo et al., 2023) adopts a variant of Algorithm 1. Zero Quant-V2 (Yao et al., 2023) can be considered as Algorithm 1 with one iteration, or a special case of Algorithm 2 where the scale matrix S is an identity matrix. Algorithm 1 Loft Q (Li et al., 2023) Require: Pretrained weight W , target rank k, quantization function q( ), dequantization function dq( ), number of iterations T 1: Ak 0, Bk 0 2: for i = 1 to T do 3: Wq q(W Ak Bk) Update quantized weight matrix 4: f W dq(Wq) 5: U, Σ, V T SVD(W f W ) SVD-based rank-k approximation 6: Ak U:,:k p Σ:k,:k, Bk p Σ:k,:k V T :k,: 7: end for Algorithm 2 LQER (Zhang et al., 2024a) Require: Pretrained weight W , target rank k, quantization function q( ), dequantization function dq( ), calibration dataset X = {xi Rm|i = 1, . . . , N} 1: Initialize vector s 0 2: for sample x in X do Calibration 3: s s + abs(x) Accumulate activation magnitude on each dimension 4: end for 5: S 1 N diag(s) Construct a diagonal matrix S 6: Wq q(W ) 7: f W dq(Wq) 8: U, Σ, V T SVD(S(W f W )) SVD on the scaled weight error 9: Ak S 1U:,:k, Bk Σ:k,:k V T :k,: Rank-k approximation with un-scaling |
| Open Source Code | Yes | We open-source our code and models at github.com/Cheng Zhang-98/QERA. |
| Open Datasets | Yes | We include both encoder-only model experiments (fine-tuning Ro BERTa-base (Liu, 2019) on GLUE (Ye et al., 2019)) and decoder-only LLM experiments (fine-tuning LLa MA-2 (Touvron et al., 2023) and LLa MA-3.1 (Dubey et al., 2024) on continuous pretraining task Slim Pajama (Soboleva et al., 2023) and supervised fine-tuning task GSM8K (Cobbe et al., 2021)). We use lm-evaluation-harness to report results on Wikitext2 (Merity et al., 2016), ARC (challenge) (Clark et al., 2018), Bool Q (Clark et al., 2019), Common Sense QA (Talmor et al., 2019), Winogrande (Sakaguchi et al., 2019), MMLU (Hendrycks et al., 2021), and Big Bench-Hard (Suzgun et al., 2022). |
| Dataset Splits | No | For Slim Pajama, we fine-tune the model on a subset for 1000 steps with rank = 8, total batch size = 64, sequence length = 1024, learning rate = 3e-5. For GSM8K, we fine-tune the model for 10 epochs with rank = 64, total batch size = 128, sequence length = 384, and learning rate = 3e-5. For GLUE experiments, the total batch size is 64 for all GLUE experiments and we train the models for 5 epochs. The learning rate ranges and batch sizes are listed in Appendix A.4.1. While standard benchmark datasets like GLUE and WikiText2 are used (which inherently have defined splits), the paper does not explicitly state the specific train/test/validation split ratios or sample counts used for these datasets within its text. |
| Hardware Specification | Yes | We perform fine-tuning experiments on four NVIDIA A100 80GB GPUs with AMD EPYC 64-Core Processor with 1024GB RAM. We perform PTQ experiments on eight NVIDIA A6000 48GB GPUs with AMD EPYC 256-Core Processor with 1024GB RAM. |
| Software Dependencies | No | We mainly use Py Torch, Transformers, PEFT, and Accelerate to implement QERA. We use Sci Py s implementation of blocked Schur algorithm (Deadman et al., 2012) to calculate the matrix square root, which runs on CPUs. The evaluation is performed with lm-evaluation-harness, Evaluate, and Alpaca Eval 2.0 (Dubois et al., 2024). We use the Hugging Face Transformers s implementation of HQQ, and reimplement Zero Quant-V2 and LQER as baselines. The paper lists several software components used (PyTorch, Transformers, PEFT, Accelerate, SciPy, lm-evaluation-harness, Evaluate, Alpaca Eval 2.0, Hugging Face Transformers) but does not specify version numbers for any of them. |
| Experiment Setup | Yes | For QPEFT experiments, we use Theorem 2, noted as QERA-approx, to initialize low-rank terms... For each method/baseline, we sweep the learning rate and record the best result. The final results are averaged over three random seeds. The learning rate ranges and batch sizes are listed in Appendix A.4.1. The total batch size is 64 for all GLUE experiments and we train the models for 5 epochs. For 4-bit experiments, we use 4-bit floating point from the QLo RA implementation in PEFT. For 3-bit experiments, we use emulated MXINT (Darvish Rouhani et al., 2023) with block size = 32 and for 2-bit experiments we use MXINT with block size = 16. Table 6 lists the learning rates for each experiment. For Slim Pajama, we fine-tune the model on a subset for 1000 steps with rank = 8, total batch size = 64, sequence length = 1024, learning rate = 3e-5. For GSM8K, we fine-tune the model for 10 epochs with rank = 64, total batch size = 128, sequence length = 384, and learning rate = 3e-5. |