Optimizing Large Language Model Training Using FP4 Quantization

Authors: Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zheng-Jun Zha, Peng Cheng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
Researcher Affiliation Collaboration Ruizhe Wang 1 2 Yeyun Gong 3 2 Xiao Liu 3 2 Guoshuai Zhao 3 2 Ziyue Yang 3 2 Baining Guo 3 Zhengjun Zha 1 Peng Cheng 3 2 1University of Science and Technology of China 2Microsoft SIGMA Team 3Microsoft Research Asia.
Pseudocode Yes The following code paragraph shows the implementation of the quantization kernel. 1 __global__ void quantize_kernel(const float* x, float* output, int x_size) { 2 int idx = block Idx.x * block Dim.x + thread Idx.x; 3 if (idx < x_size) { 4 float value = x[idx]; ... 35 }
Open Source Code Yes Our training framework can be found at aka.ms/MS.AMP.
Open Datasets Yes The training is conducted from scratch using the DCLM dataset (Li et al., 2024), a comprehensive dataset well-suited for language model pretraining. ... We evaluate the models on a diverse set of downstream tasks datasets in a zero-shot manner, including Arc (Clark et al., 2018), Bool Q (Clark et al., 2019), Hella Swag (Zellers et al., 2019), Logi QA (Liu et al., 2021), Pi QA (Bisk et al., 2020), Sci Q (Welbl et al., 2017), Openbook QA (Ob QA) (Mihaylov et al., 2018), and Lambada (Paperno et al., 2016). ... Table 3 further presents the perplexity (PPL) evaluation results for several downstream datasets including Lambada Open AI (Lbd.OAI), Lambada standard (Lbd.std) (Paperno et al., 2016), the Pile 10k (Gao et al., 2020) and Wikitext (Merity et al., 2017).
Dataset Splits No The paper mentions using a dataset for training, specifically the DCLM dataset, and evaluates on other datasets in a zero-shot manner. It describes input sequence length and batch size for training, but does not provide specific splits (e.g., percentages or counts) for training, validation, or test sets for the main DCLM dataset or for how the downstream task datasets were used beyond 'zero-shot'.
Hardware Specification Yes Leveraging the FP8 tensor cores of NVIDIA H100 GPUs to emulate FP4 computations, we train LLMs with up to 13B parameters and 100B training tokens, with minor training loss gap. For zero-shot evaluation on downstream tasks, model trained with FP4 show competitive results against BF16 models. We anticipate better speed performance gains with the availability of next-generation hardware like NVIDIA s B-series GPUs.
Software Dependencies No The paper mentions using a 'custom CUDA kernel' for FP4 quantization and the 'lm-evaluation-harness library' for evaluation, but it does not provide specific version numbers for these or any other software components.
Experiment Setup Yes Hyperparameters remain consistent across precision settings for fair comparison. The learning rate follows a warm-up and cosine decay schedule, with the warm-up phase spanning 5% of total steps and the learning rate gradually decreasing to 10% of its peak over the remaining 90%. The peak learning rate is 3e-4, with a weight decay of 0.1. For the Adam optimizer, we use β1 = 0.9, β2 = 0.95, and ϵ = 1e-8. For special hyperparameters used in FP4 method, we use k = 5 for differentiable gradient estimator and select α = 0.99 as the activation clamp and compensation quantile. Input sequences are fixed at 2048 tokens, and the batch size is 2048, comprising approximately 4M tokens.