Quamba: A Post-Training Quantization Recipe for Selective State Space Models
Authors: Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our 8-bit weight-activation quantized Mamba 2.8B SSM benefits from hardware acceleration and achieves a 1.72 lower generation latency on an Nvidia Orin Nano 8G, with only a 0.9% drop in average accuracy on zero-shot tasks. When quantizing Jamba, a 52B parameter SSM-style language model, we observe only a 1% drop in accuracy, demonstrating that our SSM quantization method is both effective and scalable for large language models, which require appropriate compression techniques for deployment. The experiments demonstrate the effectiveness and practical applicability of our approach for deploying SSM-based models of all sizes on both cloud and edge platforms. |
| Researcher Affiliation | Academia | Hung-Yueh Chiang 1 , Chi-Chih Chang 2, 3 , Natalia Frumkin 1 , Kai-Chiang Wu 2 , Diana Marculescu 1 1 Chandra Family Department of Electrical and Computer Engineering, The University of Texas at Austin 2 Department of Computer Science, National Yang Ming Chiao Tung University 3 Department of Electrical and Computer Engineering, Cornell University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods using mathematical equations and textual descriptions, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is released at https://github.com/enyac-group/Quamba. |
| Open Datasets | Yes | For zero-shot tasks, we use LM-EVAL (Gao et al., 2023), on which we evaluate baselines and Quamba on LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al.), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018) and Wino Grande (Sakaguchi et al., 2020). For perplexity, we evaluate the models using the testing set of Wiki Text2 (Merity et al., 2016) and a randomly sampled subset from validation set of Pile dataset (Gao et al., 2021). |
| Dataset Splits | Yes | The calibration set is constructed by randomly sampling 512 sentences from the Pile dataset (Gao et al., 2021). ... In panel (a), we profile TTLT (time-to-last-token) in seconds, with 512 input tokens and 512 generated tokens on Nano 8G. On A5000, we increase the input length to 2048 for both input and generated tokens, as shown in panel (b). ... For perplexity, we evaluate the models using the testing set of Wiki Text2 (Merity et al., 2016) and a randomly sampled subset from validation set of Pile dataset (Gao et al., 2021). |
| Hardware Specification | Yes | Our 8-bit weight-activation quantized Mamba 2.8B SSM benefits from hardware acceleration and achieves a 1.72 lower generation latency on an Nvidia Orin Nano 8G... We evaluate all methods on the A5000, a widely used GPU for AI workloads with 24GB of memory, emulating the setting for cloud applications. For the case of edge applications, we profile all methods on the Nvidia Orin Nano 8G. |
| Software Dependencies | No | The paper mentions "CUTLASS library (Thakkar et al., 2023)" and "Py Torch cu DNN auto-tuner", but does not provide specific version numbers for these or other software components. |
| Experiment Setup | Yes | We suppress maximum values in input activations to SSMs, which is the most sensitive to the quantization error, for finer quantization precision. For the extreme outliers in output activations from SSMs, we use the Hadamard transform to smooth out the activations. ... We collect the static scaling factors for each operator based on the absolute maximum value observed from the calibration set and apply symmetric per-tensor quantization for weights and activations, except for the input to the SSM, where we use the 99.999th percentile (i.e., the p described in Section 4.2) to clip the maximum values. ... We perform a few warm-up iterations and then report the average latency of the following 100 iterations. |