reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Authors: Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement Qua Rot using Hugging Face [Wolf et al., 2019] on top of the Py Torch framework [Paszke et al., 2019]. To quantize the inputs, we use per-token symmetric quantization (a single scale for every row) with a constant clipping ratio of 0.9 in all our experiments. We quantize the KV caches using asymmetric quantization with a group size 128 with a constant clipping ratio of 0.95. For weight quantization, we use round-to-nearest (RTN) and GPTQ [Frantar et al., 2022] with per-column (also known as per-channel) symmetric quantization, where we extract the clipping ratio using a linear search over the squared error. We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization.
Researcher Affiliation	Collaboration	Saleh Ashkboos ETH Zurich EMAIL Amirkeivan Mohtashami EPFL EMAIL Maximilian L. Croci Microsoft Research EMAIL Bo Li ETH Zurich EMAIL Pashmina Cameron Microsoft EMAIL Martin Jaggi EPFL EMAIL Dan Alistarh IST Austria & Neural Magic EMAIL Torsten Hoeﬂer ETH Zurich EMAIL James Hensman Microsoft Research EMAIL
Pseudocode	No	The paper contains flow diagrams (Figures 2, 3, 5, 6) but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at github.com/spcl/Qua Rot.
Open Datasets	Yes	We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization.
Dataset Splits	No	The paper mentions using "Wiki Text-2 [Merity et al., 2016] training set" for calibration, but does not explicitly provide specific training/validation/test dataset splits (percentages, sample counts, or explicit references to standard splits with citations) for reproduction.
Hardware Specification	Yes	On a single NVIDIA A100 GPU, modifying LLAMA2-70B with Qua Rot takes 5 minutes and quantizing the model with GPTQ takes a further 2 hours. As we target consumer-type GPUs, we evaluate all the performance experiments on NVIDIA RTX 3090 GPUs.
Software Dependencies	No	The paper mentions software like Hugging Face, PyTorch, CUTLASS, and Flash Infer, and specifies "CUDA/12.1". However, it does not provide specific version numbers for the other key software libraries used (e.g., PyTorch version, Hugging Face Transformers version).
Experiment Setup	Yes	To quantize the inputs, we use per-token symmetric quantization (a single scale for every row) with a constant clipping ratio of 0.9 in all our experiments. We quantize the KV caches using asymmetric quantization with a group size 128 with a constant clipping ratio of 0.95. For weight quantization, we use round-to-nearest (RTN) and GPTQ [Frantar et al., 2022] with per-column (also known as per-channel) symmetric quantization, where we extract the clipping ratio using a linear search over the squared error. We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization.