SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models
Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on SDXL, Pix Art-Σ, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5 , achieving 3.0 speedup over the 4-bit weight-only quantization (W4A16) baseline on the 16GB laptop 4090 GPU with INT4 precision. On the latest RTX 5090 desktop with Blackwell architecture, we achieve a 3.1 speedup compared to the W4A16 model using NVFP4 precision. Our quantization library* and inference engine are open-sourced. |
| Researcher Affiliation | Collaboration | 1MIT 2NVIDIA 3CMU 4Princeton 5UC Berkeley 6SJTU 7Pika Labs |
| Pseudocode | No | The paper describes the SVDQuant method and Nunchaku inference engine conceptually and with figures (e.g., Figure 3: Overview of SVDQuant), but does not include a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Our quantization library* and inference engine are open-sourced. *Quantization library: github.com/mit-han-lab/deepcompressor Inference Engine: github.com/mit-han-lab/nunchaku |
| Open Datasets | Yes | Datasets. Following previous works (Li et al., 2023a; Zhao et al., 2024c;b), we randomly sample the prompts in COCO Captions 2024 (Chen et al., 2015) for calibration. To evaluate the generalization capability of our method, we sample 5K prompts from the MJHQ-30K (Li et al., 2024a) and the summarized Densely Captioned Images (s DCI) (Urbanek et al., 2024) for benchmarking. |
| Dataset Splits | No | The paper mentions using specific numbers of prompts from datasets for calibration and benchmarking (e.g., "randomly sample the prompts in COCO Captions 2024 for calibration" and "sample 5K prompts from the MJHQ-30K... and the summarized Densely Captioned Images (s DCI)... for benchmarking"). However, it does not provide explicit training, test, and validation splits for the *model training process* itself, nor specific percentages or sample counts for these splits if they were used for developing the SVDQuant method. The models benchmarked are pre-trained diffusion models. |
| Hardware Specification | Yes | By eliminating CPU offloading, it offers 8.7 speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3 faster than the NF4 W4A16 baseline. ...on the 16GB laptop-level RTX 4090 and desktop-level RTX 5090 GPU, respectively. ...on the latest RTX 5090 desktop with Blackwell architecture. |
| Software Dependencies | No | The paper mentions several tools and frameworks such as "Tensor RT" and "GPTQ (Frantar et al., 2023)", and references "NVIDIA Corporation. Block Scaling in cu DNN Frontend API, 2025." However, it does not provide specific version numbers for these software components or any other libraries like PyTorch or CUDA versions that would be necessary for full reproducibility. |
| Experiment Setup | Yes | For the 8-bit setting, we use per-token dynamic activation quantization and per-channel weight quantization with a low-rank branch of rank 16. For the 4-bit setting, we adopt per-group symmetric quantization for both activations and weights, along with a low-rank branch of rank 32. INT4 quantization uses a group size of 64 with 16-bit scales. We use NVFP4 for FP4 quantization, which has native hardware support of group size of 16 with FP8 scales on Blackwell GPUs (NVIDIA Corporation, 2025). We use GPTQ (Frantar et al., 2023) to quantize the residual weights. For FLUX.1 models, the inputs of linear layers in adaptive normalization are kept in 16 bits (i.e., W4A16). For other models, key and value projections in the cross-attention are retained at 16 bits since their latency only covers less than 5% of total runtime. The smoothing factor λ Rm is a per-channel vector whose i-th element is computed as λi = max(|X:,i|)α/ max(|Wi,:|)1 α following Smooth Quant (Xiao et al., 2023) Here, X Rb m and W Rm n. It is decided offline by searching for the best migration strength α for each layer to minimize the layer output mean squared error (MSE) after SVD on the calibration dataset. ...In our experiments, we select a rank of 32, which offers a decent quality with minor overhead. |