reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

any4: Learned 4-bit Numeric Representation for LLMs

Authors: Mostafa Elhoushi, Jeff Johnson

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). Accuracy was evaluated on a wide range of model sizes, generations and families.
Researcher Affiliation	Industry	1FAIR at Meta. Correspondence to: Mostafa Elhoushi <EMAIL>, Jeff Johnson <EMAIL>.
Pseudocode	Yes	We summarize our any4 quantization algorithm in Alg. 1. Algorithm 1 any4 quantization algorithm. module2input = calibrate(model, sample_data) for module in model: w = module.weight() w Q = torch.zeros_like(w) alpha = [] beta = [] for i in range(w.shape[0]): w Si, alphai, betai = scale(w[i,:]) xi = module2input[module][i] w Q[i, :] = kmeans( samples=w Si, sample_weight=alphai*abs(xi.mean()) ) alpha.append(alphai) beta.append(betai) module.weight.data = w Q module.alpha = alpha module.beta = beta
Open Source Code	Yes	We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4.
Open Datasets	Yes	For perplexity, we ported the implementation of GPTQ for Wiki Text-2 (Merity et al., 2017), C4 (Raffel et al., 2019), and Penn Treebank (Marcus et al., 1993) that is used by codebases of other quantization papers. To add coding domain, we added perplexity on Code Parrot (Code Parrot).
Dataset Splits	Yes	Our evaluation sequence length is 2048 (following (Lin et al., 2024; Frantar et al., 2023)), calibration is on training split of each dataset, and evaluation is on the validation or test split.
Hardware Specification	Yes	We benchmark matrix multiplication of vector activation and square weight tensors from 1K to 16K on A100 80GB GPU using PyTorch 2.3.0
Software Dependencies	Yes	We benchmark matrix multiplication of vector activation and square weight tensors from 1K to 16K on A100 80GB GPU using PyTorch 2.3.0
Experiment Setup	Yes	We use group-wise scaling with group size 128, and asymmetric scaling for all models, except for Llama3 70B where we found symmetric scaling leads to better results. Our evaluation sequence length is 2048.