QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Authors: Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement Qua Rot using Hugging Face [Wolf et al., 2019] on top of the Py Torch framework [Paszke et al., 2019]. To quantize the inputs, we use per-token symmetric quantization (a single scale for every row) with a constant clipping ratio of 0.9 in all our experiments. We quantize the KV caches using asymmetric quantization with a group size 128 with a constant clipping ratio of 0.95. For weight quantization, we use round-to-nearest (RTN) and GPTQ [Frantar et al., 2022] with per-column (also known as per-channel) symmetric quantization, where we extract the clipping ratio using a linear search over the squared error. We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization. |
| Researcher Affiliation | Collaboration | Saleh Ashkboos ETH Zurich EMAIL Amirkeivan Mohtashami EPFL EMAIL Maximilian L. Croci Microsoft Research EMAIL Bo Li ETH Zurich EMAIL Pashmina Cameron Microsoft EMAIL Martin Jaggi EPFL EMAIL Dan Alistarh IST Austria & Neural Magic EMAIL Torsten Hoefler ETH Zurich EMAIL James Hensman Microsoft Research EMAIL |
| Pseudocode | No | The paper contains flow diagrams (Figures 2, 3, 5, 6) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at github.com/spcl/Qua Rot. |
| Open Datasets | Yes | We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization. |
| Dataset Splits | No | The paper mentions using "Wiki Text-2 [Merity et al., 2016] training set" for calibration, but does not explicitly provide specific training/validation/test dataset splits (percentages, sample counts, or explicit references to standard splits with citations) for reproduction. |
| Hardware Specification | Yes | On a single NVIDIA A100 GPU, modifying LLAMA2-70B with Qua Rot takes 5 minutes and quantizing the model with GPTQ takes a further 2 hours. As we target consumer-type GPUs, we evaluate all the performance experiments on NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions software like Hugging Face, PyTorch, CUTLASS, and Flash Infer, and specifies "CUDA/12.1". However, it does not provide specific version numbers for the other key software libraries used (e.g., PyTorch version, Hugging Face Transformers version). |
| Experiment Setup | Yes | To quantize the inputs, we use per-token symmetric quantization (a single scale for every row) with a constant clipping ratio of 0.9 in all our experiments. We quantize the KV caches using asymmetric quantization with a group size 128 with a constant clipping ratio of 0.95. For weight quantization, we use round-to-nearest (RTN) and GPTQ [Frantar et al., 2022] with per-column (also known as per-channel) symmetric quantization, where we extract the clipping ratio using a linear search over the squared error. We use 128 samples from Wiki Text-2 [Merity et al., 2016] training set with 2048 sequence length as the calibration set during GPTQ quantization. |