Radio: Rate–Distortion Optimization for Large Language Model Compression

Authors: Sean I. Young

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To study the rate distortion behavior of a typical quantized LLM, we apply Algorithm 1 to the quantization of the Meta Open Pretrained Transformer (OPT) (S. Zhang et al., 2022) and Llama-2 (Touvron et al., 2023) families of language models (obtained from the Hugging Face Hub), comparing the performance of the proposed quantization method with baselines on next token prediction and question answering tasks.
Researcher Affiliation Academia 1 Martinos Center, Harvard Medical School, Boston, MA, USA. 2 Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA, USA. Correspondence to: Sean I. Young <EMAIL>.
Pseudocode Yes Algorithm 1. Radio: Rate Distortion Optimization for LLM Compression
Open Source Code Yes To ensure the reproducibility of results in this work, we make our Py Torch Radio program available on our Git Hub project website, where readers can also ask questions about this work.
Open Datasets Yes For calibration data, we source 128 examples from the training split of the C4 dataset (Raffel et al., 2020). We test on the test splits of Wiki Text2 (Merity et al., 2022) and C4 for next token prediction and those of GSM8K (Cobbe et al., 2021), ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2019), and Wino Grande (Sakaguchi et al., 2021) for question-answering tasks.
Dataset Splits Yes For calibration data, we source 128 examples from the training split of the C4 dataset (Raffel et al., 2020). We test on the test splits of Wiki Text2 (Merity et al., 2022) and C4 for next token prediction and those of GSM8K (Cobbe et al., 2021), ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2019), and Wino Grande (Sakaguchi et al., 2021) for question-answering tasks.
Hardware Specification Yes In terms of execution time, Radio (for 64 iterations) and OWQ/GPTQ require 47 minutes and 18 minutes, respectively (excluding testing), to quantize the 7B model on an Nvidia A100. [...] our custom CUDA kernel leads to a 3.8x speed up over the FP16 matrix-vector multiply performed using the default cu BLAS matmul on Nvidia A6000.
Software Dependencies No To ensure the reproducibility of results in this work, we make our Py Torch Radio program available on our Git Hub project website, where readers can also ask questions about this work. Appendix A lists our CUDA kernel. Appendices B C provide derivations for our main theoretical results and Appendix D additionally details the Py Torch code and command line options used to obtain the results of GPTQ (Frantar et al., 2022), OWQ (Lee et al., 2024), and AWQ (Lin et al., 2024). The paper mentions "Py Torch" and "CUDA kernel" but does not specify their version numbers or the versions of other software dependencies.
Experiment Setup Yes We use a combined row column group size of 512 for OPT (768 for 125M, 66B) and 256 for Llama 2 models, batch size of 16, and 17 tokens from each sequence of tokens of length 2048, and optimize for a maximum of 64 iterations. The optimal hyperparameter values are batch size: 16, token count: 17, and group size: 512.