Optimizing Large Language Model Training Using FP4 Quantization
Authors: Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zheng-Jun Zha, Peng Cheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training. |
| Researcher Affiliation | Collaboration | Ruizhe Wang 1 2 Yeyun Gong 3 2 Xiao Liu 3 2 Guoshuai Zhao 3 2 Ziyue Yang 3 2 Baining Guo 3 Zhengjun Zha 1 Peng Cheng 3 2 1University of Science and Technology of China 2Microsoft SIGMA Team 3Microsoft Research Asia. |
| Pseudocode | Yes | The following code paragraph shows the implementation of the quantization kernel. 1 __global__ void quantize_kernel(const float* x, float* output, int x_size) { 2 int idx = block Idx.x * block Dim.x + thread Idx.x; 3 if (idx < x_size) { 4 float value = x[idx]; ... 35 } |
| Open Source Code | Yes | Our training framework can be found at aka.ms/MS.AMP. |
| Open Datasets | Yes | The training is conducted from scratch using the DCLM dataset (Li et al., 2024), a comprehensive dataset well-suited for language model pretraining. ... We evaluate the models on a diverse set of downstream tasks datasets in a zero-shot manner, including Arc (Clark et al., 2018), Bool Q (Clark et al., 2019), Hella Swag (Zellers et al., 2019), Logi QA (Liu et al., 2021), Pi QA (Bisk et al., 2020), Sci Q (Welbl et al., 2017), Openbook QA (Ob QA) (Mihaylov et al., 2018), and Lambada (Paperno et al., 2016). ... Table 3 further presents the perplexity (PPL) evaluation results for several downstream datasets including Lambada Open AI (Lbd.OAI), Lambada standard (Lbd.std) (Paperno et al., 2016), the Pile 10k (Gao et al., 2020) and Wikitext (Merity et al., 2017). |
| Dataset Splits | No | The paper mentions using a dataset for training, specifically the DCLM dataset, and evaluates on other datasets in a zero-shot manner. It describes input sequence length and batch size for training, but does not provide specific splits (e.g., percentages or counts) for training, validation, or test sets for the main DCLM dataset or for how the downstream task datasets were used beyond 'zero-shot'. |
| Hardware Specification | Yes | Leveraging the FP8 tensor cores of NVIDIA H100 GPUs to emulate FP4 computations, we train LLMs with up to 13B parameters and 100B training tokens, with minor training loss gap. For zero-shot evaluation on downstream tasks, model trained with FP4 show competitive results against BF16 models. We anticipate better speed performance gains with the availability of next-generation hardware like NVIDIA s B-series GPUs. |
| Software Dependencies | No | The paper mentions using a 'custom CUDA kernel' for FP4 quantization and the 'lm-evaluation-harness library' for evaluation, but it does not provide specific version numbers for these or any other software components. |
| Experiment Setup | Yes | Hyperparameters remain consistent across precision settings for fair comparison. The learning rate follows a warm-up and cosine decay schedule, with the warm-up phase spanning 5% of total steps and the learning rate gradually decreasing to 10% of its peak over the remaining 90%. The peak learning rate is 3e-4, with a weight decay of 0.1. For the Adam optimizer, we use β1 = 0.9, β2 = 0.95, and ϵ = 1e-8. For special hyperparameters used in FP4 method, we use k = 5 for differentiable gradient estimator and select α = 0.99 as the activation clamp and compensation quantile. Input sequences are fixed at 2048 tokens, and the batch size is 2048, comprising approximately 4M tokens. |