Scaling FP8 training to trillion-token LLMs

Authors: Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens... we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a 34% throughput improvement. 6 EXPERIMENTS We conducted extensive experiments to evaluate the effectiveness of our proposed FP8 training method for Large Language Models (LLMs) across various scales.
Researcher Affiliation Collaboration Maxim Fishman Brian Chmiel Ron Banner Daniel Soudry Intel, Israel Department of Electrical and Computer Engineering Technion, Haifa, Israel
Pseudocode No The paper describes the Smooth-Swi GLU modification and FP8 optimizer methodology in prose and mathematical equations (e.g., Equation 3, scaling factor computation steps in Section 4.4), but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes A reference implementation is supplied in https: //github.com/Anonymous1252022/Megatron-Deep Speed. Reproducibility The abstract of the paper provides a link to an anonymous Git Hub repository (https://github.com/Anonymous1252022/Megatron-Deep Speed) containing all the code and necessary details for reproducing the experiments.
Open Datasets Yes We trained the models on the open-source Red Pajama dataset (Computer, 2023) for 2 trillion tokens... Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/Red Pajama-Data.
Dataset Splits No The paper states that models were trained on the 'open-source Red Pajama dataset (Computer, 2023) for 2 trillion tokens' but does not specify how this dataset was split into training, validation, or test sets for the main training process. It mentions evaluation on 'downstream tasks' for zero-shot performance, but this refers to separate benchmark datasets, not splits of the Red Pajama dataset itself.
Hardware Specification Yes We successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators. All training was conducted on 256 Intel Gaudi2 devices. The measurement were done on 8 Intel Gaudi2 devices.
Software Dependencies No The paper mentions 'Megatron-Deep Speed' in the context of the code repository. However, it does not provide specific version numbers for software dependencies like Python, PyTorch, or other libraries used in the experiments.
Experiment Setup Yes We trained the models on the open-source Red Pajama dataset (Computer, 2023) for 2 trillion tokens, maintaining hyperparameters consistent with Touvron et al. (2023). The FP8 model was trained using the standard format (Micikevicius et al., 2022) which includes saving a high precision weight matrix and quantization to E4M3 for the forward phase and E5M2 for the backward phase with delayed scaling, similar to Nvidia s transformer Engine. Our investigation revealed that different precision requirements exist for each moment: 1. First Moment: The E4M3 format... 2. Second Moment: The E5M2 format... Micro BS 1 (from Table 3).