SuFP: Piecewise Bit Allocation Floating-Point for Robust Neural Network Quantization

Authors: Geonwoo Ko, Sungyeob Yoo, Seri Ham, Seeyeon Kim, Minkyu Kim, Joo-Young Kim

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate the robustness and accuracy of Su FP over various neural networks in the vision and natural language processing domains. Remarkably, Su FP shows its superiority in large models such as large language model (Llama 2) and text-to-image generative model (Stable Diffusion v2). We also verify training feasibility on Res Net models and highlight the structural design of Su FP for general applicability. This section evaluates the proposed Su FP. We comprehensively assess Su FP s performance by quantizing both weights and activations across various models, including vision, language, and generative tasks.
Researcher Affiliation Collaboration Geonwoo Ko EMAIL Korea Advanced Institute of Science and Technology (KAIST) Sungyeob Yoo EMAIL Korea Advanced Institute of Science and Technology (KAIST) Seri Ham EMAIL Korea Advanced Institute of Science and Technology (KAIST) Seeyeon Kim EMAIL Korea Advanced Institute of Science and Technology (KAIST) Minkyu Kim EMAIL KRAFTON Joo-Young Kim EMAIL Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode Yes Algorithm 1: Su FP Decoding Algorithm Algorithm 2: scaling bias-optimal quantization flow
Open Source Code No The paper does not provide concrete access to source code. It states, "We implement Su FP using Py Torch with Hugging Face transformer and Torch Vision libraries." but does not explicitly provide a repository link or state that their specific implementation for the paper is open-source or available in supplementary materials.
Open Datasets Yes For computer vision tasks, we benchmark our method on the Res Net18, Res Net50 (He et al., 2016), Vision Transformer (Vi T) (Dosovitskiy et al., 2020), and Efficient Net-v2 (Tan & Le, 2021) models with the Image Net dataset (Deng et al., 2009a). For natural language tasks, we benchmark our method using the BERT-base model (Devlin et al., 2018) on datasets such as MRPC, Co LA (Warstadt et al., 2018), and SQu AD 2.0 (Rajpurkar et al., 2018). For text-to-image generative tasks, we benchmark our approach using the Stable Diffusion v2 (Rombach et al., 2021) on the COCO dataset (Lin et al., 2014). For LLMs, we benchmark our method using Llama 2 model Touvron et al. (2023) on MMLU. We train image classifier using Res Net-18 (He et al., 2016) and Res Net-50 (He et al., 2016) backbone on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), Image Net100 (Deng et al., 2009b) and Tiny-Image Net (Le & Yang, 2015) datasets.
Dataset Splits No The paper mentions several datasets (e.g., ImageNet, MRPC, CoLA, SQuAD 2.0, COCO, MMLU, CIFAR-10, CIFAR-100, ImageNet100, Tiny-ImageNet) but does not explicitly provide specific training/test/validation split percentages, sample counts, or explicit instructions for reproducing the data partitioning for their experiments. It only mentions training classifiers for 100 epochs and setting margin terms for different datasets in the context of quantization.
Hardware Specification No The paper states, "All designs are synthesized using the Synopsys Design Compiler, optimized for the 28nm CMOS technology, and set to operate at 1 GHz clock frequency." This describes the synthesis environment for the hardware design of the Su FP Processing Element (PE), not the specific hardware (e.g., GPU models, CPU models) used to run the neural network experiments or perform model training/inference.
Software Dependencies No The paper mentions using "Py Torch with Hugging Face transformer and Torch Vision libraries" for implementation and "System Verilog to implement the Su FP PE and various configurations of FP8" with "Synopsys Design Compiler" for synthesis. However, no specific version numbers are provided for PyTorch, Hugging Face transformer, Torch Vision, System Verilog, or Synopsys Design Compiler.
Experiment Setup Yes As additional implementation details for the training experiments described in Section 4.4, we trained the classifier for 100 epochs using an SGD optimizer with a learning rate of 0.1, momentum of 0.9, and weight decay of 5e-4. We applied loss scaling (Mellempudi et al., 2019), which allows small gradient values to be represented with a smaller bit-width representation. To dynamically find the scaling bias for tensor quantization, we set bias as bias = floor(log2(max(abs(x)))) margin. For weight and activation quantization, we set the margin term to 4 for the CIFAR-10 dataset and 2 for the CIFAR-100 dataset. For gradient quantization, we set the term to 1 for both datasets. We do not quantize the input of the first convolution and final fully connected layer to stabilize the training procedure (Mellempudi et al., 2019). We used modified Res Net-18 architecture for CIFAR-10/100 datasets; the first convolution layer is replaced by kernel size 3 and stride 1, and the last pooling layer is replaced by the identity function.