Scaling Laws for Floating–Point Quantization Training

Authors: Xingwu Sun, Shuaipeng Li, Ruobing Xie, Weidong Han, Kan Wu, Zhen Yang, Yixing Li, An Wang, Shuai Li, Jinbao Xue, Yu Cheng, Yangyu Tao, Zhanhui Kang, Cheng-Zhong Xu, Di Wang, Jie Jiang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we thoroughly explore the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in FP quantization training performance of LLM models. In addition to an accurate FP quantization unified scaling law, we also provide valuable suggestions for the community: ... we carefully design a comprehensive set of explorations with experiments of different precision settings (training 366 models), exploring the basic scaling law formation, as well as the potential impact of the quantization targets, exponent and mantissa, and block sizes on the loss. Finally, we aggregate these factors to get our final scaling law for FP quantized training with valuable insights to guide the LLM training under low precision. Figure 1 illustrates the fitting results of our Capybara scaling law compared with other ones, demonstrating our advantages on predicting LLM performances under different float quantized training settings.
Researcher Affiliation Collaboration 1Tencent Hunyuan 2University of Macau 3The Chinese University of Hong Kong 4Institute of Science Tokyo. Correspondence to: Shuaipeng Li <EMAIL>, Chengzhong Xu <EMAIL>, Di Wang <EMAIL>.
Pseudocode No The paper describes mathematical formulations and quantization methods in narrative text and equations, but does not present any structured pseudocode blocks or algorithms.
Open Source Code No The paper mentions simulating methods using QPyTorch (Zhang et al., 2019), which is a third-party tool. However, there is no explicit statement or link indicating that the authors' own implementation code for the described methodology is open-source or publicly available.
Open Datasets Yes We trained and evaluated a range of LLa MA (Dubey et al., 2024) architecture models on a subset of the Dolma V1.7 dataset (Soldaini et al., 2024), using the same sampling proportion as for the OLMo 7B-v1.7 model (Groeneveld et al., 2024).
Dataset Splits No The paper states: "Our experiments systematically explored language model pretraining across N {41, 85, 154, 679} million parameters and D {10, 20, 50, 100} billion tokens." and "using the same sampling proportion as for the OLMo 7B-v1.7 model (Groeneveld et al., 2024).". While it mentions parameters and token counts, it does not provide specific train/validation/test splits, their percentages, or absolute counts for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments. It refers to "modern hardware" in a general context but lacks information on specific GPU models, CPU types, or other hardware specifications.
Software Dependencies No The paper mentions using "QPy Torch (Zhang et al., 2019)" for simulation, and AdamW as an optimizer, but it does not specify version numbers for these software components or any other libraries crucial for replication.
Experiment Setup Yes Detailed hyperparameters and ablation studies are provided in Table 1 and Table 3. Table 1, "Model hyper-parameters for each size," includes specific values for Layers, Hidden Size, FFN Hidden Size, Attention Heads, Attention Head size, Optimizer (AdamW with β1=0.9, β2=0.95, ε=1e-8), Weight Decay (0.1), Clip Grad Norm (1.0), Max LR (3.0e-4), Min LR (0), LR Decay (Cosine Decay Rate 10%), Sequence Length (2048), Batch Size (# Tokens) (2M), and Warmup Steps (500).