any4: Learned 4-bit Numeric Representation for LLMs

Authors: Mostafa Elhoushi, Jeff Johnson

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). Accuracy was evaluated on a wide range of model sizes, generations and families.
Researcher Affiliation Industry 1FAIR at Meta. Correspondence to: Mostafa Elhoushi <EMAIL>, Jeff Johnson <EMAIL>.
Pseudocode Yes We summarize our any4 quantization algorithm in Alg. 1. Algorithm 1 any4 quantization algorithm. module2input = calibrate(model, sample_data) for module in model: w = module.weight() w Q = torch.zeros_like(w) alpha = [] beta = [] for i in range(w.shape[0]): w Si, alphai, betai = scale(w[i,:]) xi = module2input[module][i] w Q[i, :] = kmeans( samples=w Si, sample_weight=alphai*abs(xi.mean()) ) alpha.append(alphai) beta.append(betai) module.weight.data = w Q module.alpha = alpha module.beta = beta
Open Source Code Yes We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4.
Open Datasets Yes For perplexity, we ported the implementation of GPTQ for Wiki Text-2 (Merity et al., 2017), C4 (Raffel et al., 2019), and Penn Treebank (Marcus et al., 1993) that is used by codebases of other quantization papers. To add coding domain, we added perplexity on Code Parrot (Code Parrot).
Dataset Splits Yes Our evaluation sequence length is 2048 (following (Lin et al., 2024; Frantar et al., 2023)), calibration is on training split of each dataset, and evaluation is on the validation or test split.
Hardware Specification Yes We benchmark matrix multiplication of vector activation and square weight tensors from 1K to 16K on A100 80GB GPU using PyTorch 2.3.0
Software Dependencies Yes We benchmark matrix multiplication of vector activation and square weight tensors from 1K to 16K on A100 80GB GPU using PyTorch 2.3.0
Experiment Setup Yes We use group-wise scaling with group size 128, and asymmetric scaling for all models, except for Llama3 70B where we found symmetric scaling leads to better results. Our evaluation sequence length is 2048.