Accumulator-Aware Post-Training Quantization for Large Language Models

Authors: Ian Colbert, Giuseppe Franco, Fabian Grob, Jinjie Zhang, Rayan Saab

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate AXE using recent language generation models; when quantizing Llama3 8B for a 16-bit multi-stage accumulation datapath, AXE maintains up to 98% of the FP16 perplexity, surpassing naïve bit width manipulation by up to 15%. (Section 5: Experiments)
Researcher Affiliation Collaboration Ian Colbert EMAIL AMD; Giuseppe Franco EMAIL AMD; Fabian Grob EMAIL TUM; Jinjie Zhang EMAIL Amazon; Rayan Saab EMAIL University of California San Diego
Pseudocode Yes Algorithm 1 Accumulator-Aware GPFQ. Our accumulator-aware GPFQ variant quantizes W to M bits given input activations X and their N-bit quantized counterparts X. (page 5); Algorithm 2 Accumulator-Aware OPTQ. Our accumulator-aware OPTQ variant quantizes W to M bits given H 1 = Cholesky((2 X XT + ηI) 1), where η is a small dampening factor to avoid numerical issues. (page 11)
Open Source Code Yes Our open-source implementations are made available as part of the Brevitas quantization library v0.12.01 (Pappalardo et al., 2025).1https://github.com/Xilinx/brevitas/tree/v0.12.0
Open Datasets Yes We conduct experiments on GPT2 (Radford et al., 2019), OPT (Zhang et al., 2022a), Smol LM2 (Allal et al., 2024), Pythia (Biderman et al., 2023), and Llama3 (Dubey et al., 2024) models using Wiki Text2 (Merity et al., 2016) for calibration.
Dataset Splits No The paper mentions using Wiki Text2 for calibration and perplexity evaluation, and zero-shot accuracy for evaluation tasks, but does not explicitly provide specific training, validation, and test dataset splits or their percentages/counts needed for reproduction for all evaluations beyond the calibration set. For example, it states 'We build our calibration set using 128 samples randomly selected from the Wiki Text2 dataset (Merity et al., 2016) without replacement using a fixed sequence length of 2048 tokens'.
Hardware Specification Yes All models are quantized via the Brevitas (Franco et al., 2025) quantization library using a single AMD MI210 GPU with 64 GB of memory.
Software Dependencies Yes Our open-source implementations are made available as part of the Brevitas quantization library v0.12.01 (Pappalardo et al., 2025).
Experiment Setup Yes We build our calibration set using 128 samples randomly selected from the Wiki Text2 dataset (Merity et al., 2016) without replacement using a fixed sequence length of 2048 tokens for all models except GPT2 (Radford et al., 2019), which is restricted to a maximum sequence length of 1024 by the library. When inverting H in both OPTQ and GPFQ, we use the standard dampening factor of 1% of the average of its diagonal. When applying Smooth Quant, we perform a light grid search over its α parameter and find α = 0.4 to generally perform the best on average for Llama3, so we use this for all models.