Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference
Authors: Ke Yi, Zengke Liu, jianwei zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the proposed method, we conducted experiments on LLa MA families, Qwen families, Mistral, etc. On LLa MA3-70B, Rotated Runtime Smooth can gain perplexity improvement from 57.33 to 6.66 under A4W4 quantization compared with the state-of-the-art. We conducted experiments on mainstream LLMs, including LLa MA families (Touvron et al., 2023b) (LLa MA2-13B, LLa MA2-70B, LLa MA3-8B, LLa MA3-70B, LLa MA3.1-8B, LLa MA3.1-70B), Qwen families (Yang et al., 2024) (Qwen1.5-7B, Qwen1.5-14B), Mistral (Jiang et al., 2023) and Mixtral (Jiang et al., 2024). We evaluate the performance of the models on Wiki Text-2 perplexity and zero-shot Common Sense QA benchmarks. |
| Researcher Affiliation | Collaboration | 1 South China Univesity of Technology 2 Alibaba group 3 Key Laboratory of System Software (CAS) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences 4 University of Chinese Academy of Sciences. |
| Pseudocode | No | The paper describes procedures in natural language, such as in Section 3.2, which outlines the pipeline of Runtime Smooth: "As shown in Figure 4, the pipeline of GEMM fused with Runtime Smooth can be described as 1. Reorder the activation and weight according to channel-wise maximums of activation; 2. Group up the activation and set the group-wise maximum as smoothing scales; 3. Calculate the matrix multiplication of the tiled block and multiply the runtime scale on the dequantized result, followed by a reduction operation." However, it does not present any clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | Project page: https://coco58323.github.io/rrs2024.github. io/. The provided URL is a project page (github.io domain) and not a direct link to a code repository. The paper does not contain an unambiguous statement of code release or a direct repository link. |
| Open Datasets | Yes | We evaluate the performance of the models on Wiki Text-2 perplexity and zero-shot Common Sense QA benchmarks. The Common Sense QA benchmarks include ARC-e, ARC-c (Clark et al., 2018), Bool Q (Clark et al., 2019), and OBQA (Mihaylov et al., 2018). |
| Dataset Splits | Yes | We apply standard GPTQ settings by using 128 samples from Wiki Text-2 with a sequence length of 2048 as the calibration set. |
| Hardware Specification | Yes | We evaluate the GEMM kernel fused with Runtime Smooth on NVBench (NVIDIA, 2024) with RTX 4070 Ti, as shown in Figure 6. |
| Software Dependencies | No | The paper mentions "NVBench (NVIDIA, 2024)" as a benchmarking tool but does not specify any software libraries or frameworks with version numbers used for implementing their methodology. |
| Experiment Setup | Yes | Activation quantization employs per-channel symmetric scheme with round-to-nearest (RTN) strategy. KV cache quantization employs sub-channel symmetric scheme with groupsize 128 and round-to-nearest (RTN) strategy. In most cases, weight quantization employs per-channel symmetric scheme with GPTQ (Frantar et al., 2022) strategy, except for baseline RTN . We apply standard GPTQ settings by using 128 samples from Wiki Text-2 with a sequence length of 2048 as the calibration set. The group size of the smoothing scale is 128, the same as the block size of the GEMM kernel. |