Radio: Rate–Distortion Optimization for Large Language Model Compression
Authors: Sean I. Young
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To study the rate distortion behavior of a typical quantized LLM, we apply Algorithm 1 to the quantization of the Meta Open Pretrained Transformer (OPT) (S. Zhang et al., 2022) and Llama-2 (Touvron et al., 2023) families of language models (obtained from the Hugging Face Hub), comparing the performance of the proposed quantization method with baselines on next token prediction and question answering tasks. |
| Researcher Affiliation | Academia | 1 Martinos Center, Harvard Medical School, Boston, MA, USA. 2 Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA, USA. Correspondence to: Sean I. Young <EMAIL>. |
| Pseudocode | Yes | Algorithm 1. Radio: Rate Distortion Optimization for LLM Compression |
| Open Source Code | Yes | To ensure the reproducibility of results in this work, we make our Py Torch Radio program available on our Git Hub project website, where readers can also ask questions about this work. |
| Open Datasets | Yes | For calibration data, we source 128 examples from the training split of the C4 dataset (Raffel et al., 2020). We test on the test splits of Wiki Text2 (Merity et al., 2022) and C4 for next token prediction and those of GSM8K (Cobbe et al., 2021), ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2019), and Wino Grande (Sakaguchi et al., 2021) for question-answering tasks. |
| Dataset Splits | Yes | For calibration data, we source 128 examples from the training split of the C4 dataset (Raffel et al., 2020). We test on the test splits of Wiki Text2 (Merity et al., 2022) and C4 for next token prediction and those of GSM8K (Cobbe et al., 2021), ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2019), and Wino Grande (Sakaguchi et al., 2021) for question-answering tasks. |
| Hardware Specification | Yes | In terms of execution time, Radio (for 64 iterations) and OWQ/GPTQ require 47 minutes and 18 minutes, respectively (excluding testing), to quantize the 7B model on an Nvidia A100. [...] our custom CUDA kernel leads to a 3.8x speed up over the FP16 matrix-vector multiply performed using the default cu BLAS matmul on Nvidia A6000. |
| Software Dependencies | No | To ensure the reproducibility of results in this work, we make our Py Torch Radio program available on our Git Hub project website, where readers can also ask questions about this work. Appendix A lists our CUDA kernel. Appendices B C provide derivations for our main theoretical results and Appendix D additionally details the Py Torch code and command line options used to obtain the results of GPTQ (Frantar et al., 2022), OWQ (Lee et al., 2024), and AWQ (Lin et al., 2024). The paper mentions "Py Torch" and "CUDA kernel" but does not specify their version numbers or the versions of other software dependencies. |
| Experiment Setup | Yes | We use a combined row column group size of 512 for OPT (768 for 125M, 66B) and 256 for Llama 2 models, batch size of 16, and 17 tokens from each sequence of tokens of length 2048, and optimize for a maximum of 64 iterations. The optimal hyperparameter values are batch size: 16, token count: 17, and group size: 512. |