reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Radio: Rate–Distortion Optimization for Large Language Model Compression

Authors: Sean I. Young

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To study the rate distortion behavior of a typical quantized LLM, we apply Algorithm 1 to the quantization of the Meta Open Pretrained Transformer (OPT) (S. Zhang et al., 2022) and Llama-2 (Touvron et al., 2023) families of language models (obtained from the Hugging Face Hub), comparing the performance of the proposed quantization method with baselines on next token prediction and question answering tasks.
Researcher Affiliation	Academia	1 Martinos Center, Harvard Medical School, Boston, MA, USA. 2 Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA, USA. Correspondence to: Sean I. Young <EMAIL>.
Pseudocode	Yes	Algorithm 1. Radio: Rate Distortion Optimization for LLM Compression
Open Source Code	Yes	To ensure the reproducibility of results in this work, we make our Py Torch Radio program available on our Git Hub project website, where readers can also ask questions about this work.
Open Datasets	Yes	For calibration data, we source 128 examples from the training split of the C4 dataset (Raffel et al., 2020). We test on the test splits of Wiki Text2 (Merity et al., 2022) and C4 for next token prediction and those of GSM8K (Cobbe et al., 2021), ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2019), and Wino Grande (Sakaguchi et al., 2021) for question-answering tasks.
Dataset Splits	Yes	For calibration data, we source 128 examples from the training split of the C4 dataset (Raffel et al., 2020). We test on the test splits of Wiki Text2 (Merity et al., 2022) and C4 for next token prediction and those of GSM8K (Cobbe et al., 2021), ARC (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2019), and Wino Grande (Sakaguchi et al., 2021) for question-answering tasks.
Hardware Specification	Yes	In terms of execution time, Radio (for 64 iterations) and OWQ/GPTQ require 47 minutes and 18 minutes, respectively (excluding testing), to quantize the 7B model on an Nvidia A100. [...] our custom CUDA kernel leads to a 3.8x speed up over the FP16 matrix-vector multiply performed using the default cu BLAS matmul on Nvidia A6000.
Software Dependencies	No	To ensure the reproducibility of results in this work, we make our Py Torch Radio program available on our Git Hub project website, where readers can also ask questions about this work. Appendix A lists our CUDA kernel. Appendices B C provide derivations for our main theoretical results and Appendix D additionally details the Py Torch code and command line options used to obtain the results of GPTQ (Frantar et al., 2022), OWQ (Lee et al., 2024), and AWQ (Lin et al., 2024). The paper mentions "Py Torch" and "CUDA kernel" but does not specify their version numbers or the versions of other software dependencies.
Experiment Setup	Yes	We use a combined row column group size of 512 for OPT (768 for 125M, 66B) and 256 for Llama 2 models, batch size of 16, and 17 tokens from each sequence of tokens of length 2048, and optimize for a maximum of 64 iterations. The optimal hyperparameter values are batch size: 16, token count: 17, and group size: 512.