Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Authors: Ke Yi, Zengke Liu, jianwei zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the proposed method, we conducted experiments on LLa MA families, Qwen families, Mistral, etc. On LLa MA3-70B, Rotated Runtime Smooth can gain perplexity improvement from 57.33 to 6.66 under A4W4 quantization compared with the state-of-the-art. We conducted experiments on mainstream LLMs, including LLa MA families (Touvron et al., 2023b) (LLa MA2-13B, LLa MA2-70B, LLa MA3-8B, LLa MA3-70B, LLa MA3.1-8B, LLa MA3.1-70B), Qwen families (Yang et al., 2024) (Qwen1.5-7B, Qwen1.5-14B), Mistral (Jiang et al., 2023) and Mixtral (Jiang et al., 2024). We evaluate the performance of the models on Wiki Text-2 perplexity and zero-shot Common Sense QA benchmarks.
Researcher Affiliation Collaboration 1 South China Univesity of Technology 2 Alibaba group 3 Key Laboratory of System Software (CAS) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences 4 University of Chinese Academy of Sciences.
Pseudocode No The paper describes procedures in natural language, such as in Section 3.2, which outlines the pipeline of Runtime Smooth: "As shown in Figure 4, the pipeline of GEMM fused with Runtime Smooth can be described as 1. Reorder the activation and weight according to channel-wise maximums of activation; 2. Group up the activation and set the group-wise maximum as smoothing scales; 3. Calculate the matrix multiplication of the tiled block and multiply the runtime scale on the dequantized result, followed by a reduction operation." However, it does not present any clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No Project page: https://coco58323.github.io/rrs2024.github. io/. The provided URL is a project page (github.io domain) and not a direct link to a code repository. The paper does not contain an unambiguous statement of code release or a direct repository link.
Open Datasets Yes We evaluate the performance of the models on Wiki Text-2 perplexity and zero-shot Common Sense QA benchmarks. The Common Sense QA benchmarks include ARC-e, ARC-c (Clark et al., 2018), Bool Q (Clark et al., 2019), and OBQA (Mihaylov et al., 2018).
Dataset Splits Yes We apply standard GPTQ settings by using 128 samples from Wiki Text-2 with a sequence length of 2048 as the calibration set.
Hardware Specification Yes We evaluate the GEMM kernel fused with Runtime Smooth on NVBench (NVIDIA, 2024) with RTX 4070 Ti, as shown in Figure 6.
Software Dependencies No The paper mentions "NVBench (NVIDIA, 2024)" as a benchmarking tool but does not specify any software libraries or frameworks with version numbers used for implementing their methodology.
Experiment Setup Yes Activation quantization employs per-channel symmetric scheme with round-to-nearest (RTN) strategy. KV cache quantization employs sub-channel symmetric scheme with groupsize 128 and round-to-nearest (RTN) strategy. In most cases, weight quantization employs per-channel symmetric scheme with GPTQ (Frantar et al., 2022) strategy, except for baseline RTN . We apply standard GPTQ settings by using 128 samples from Wiki Text-2 with a sequence length of 2048 as the calibration set. The group size of the smoothing scale is 128, the same as the block size of the GEMM kernel.