any4: Learned 4-bit Numeric Representation for LLMs
Authors: Mostafa Elhoushi, Jeff Johnson
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). Accuracy was evaluated on a wide range of model sizes, generations and families. |
| Researcher Affiliation | Industry | 1FAIR at Meta. Correspondence to: Mostafa Elhoushi <EMAIL>, Jeff Johnson <EMAIL>. |
| Pseudocode | Yes | We summarize our any4 quantization algorithm in Alg. 1. Algorithm 1 any4 quantization algorithm. module2input = calibrate(model, sample_data) for module in model: w = module.weight() w Q = torch.zeros_like(w) alpha = [] beta = [] for i in range(w.shape[0]): w Si, alphai, betai = scale(w[i,:]) xi = module2input[module][i] w Q[i, :] = kmeans( samples=w Si, sample_weight=alphai*abs(xi.mean()) ) alpha.append(alphai) beta.append(betai) module.weight.data = w Q module.alpha = alpha module.beta = beta |
| Open Source Code | Yes | We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4. |
| Open Datasets | Yes | For perplexity, we ported the implementation of GPTQ for Wiki Text-2 (Merity et al., 2017), C4 (Raffel et al., 2019), and Penn Treebank (Marcus et al., 1993) that is used by codebases of other quantization papers. To add coding domain, we added perplexity on Code Parrot (Code Parrot). |
| Dataset Splits | Yes | Our evaluation sequence length is 2048 (following (Lin et al., 2024; Frantar et al., 2023)), calibration is on training split of each dataset, and evaluation is on the validation or test split. |
| Hardware Specification | Yes | We benchmark matrix multiplication of vector activation and square weight tensors from 1K to 16K on A100 80GB GPU using PyTorch 2.3.0 |
| Software Dependencies | Yes | We benchmark matrix multiplication of vector activation and square weight tensors from 1K to 16K on A100 80GB GPU using PyTorch 2.3.0 |
| Experiment Setup | Yes | We use group-wise scaling with group size 128, and asymmetric scaling for all models, except for Llama3 70B where we found symmetric scaling leads to better results. Our evaluation sequence length is 2048. |