Scaling FP8 training to trillion-token LLMs
Authors: Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens... we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a 34% throughput improvement. 6 EXPERIMENTS We conducted extensive experiments to evaluate the effectiveness of our proposed FP8 training method for Large Language Models (LLMs) across various scales. |
| Researcher Affiliation | Collaboration | Maxim Fishman Brian Chmiel Ron Banner Daniel Soudry Intel, Israel Department of Electrical and Computer Engineering Technion, Haifa, Israel |
| Pseudocode | No | The paper describes the Smooth-Swi GLU modification and FP8 optimizer methodology in prose and mathematical equations (e.g., Equation 3, scaling factor computation steps in Section 4.4), but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | A reference implementation is supplied in https: //github.com/Anonymous1252022/Megatron-Deep Speed. Reproducibility The abstract of the paper provides a link to an anonymous Git Hub repository (https://github.com/Anonymous1252022/Megatron-Deep Speed) containing all the code and necessary details for reproducing the experiments. |
| Open Datasets | Yes | We trained the models on the open-source Red Pajama dataset (Computer, 2023) for 2 trillion tokens... Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/Red Pajama-Data. |
| Dataset Splits | No | The paper states that models were trained on the 'open-source Red Pajama dataset (Computer, 2023) for 2 trillion tokens' but does not specify how this dataset was split into training, validation, or test sets for the main training process. It mentions evaluation on 'downstream tasks' for zero-shot performance, but this refers to separate benchmark datasets, not splits of the Red Pajama dataset itself. |
| Hardware Specification | Yes | We successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators. All training was conducted on 256 Intel Gaudi2 devices. The measurement were done on 8 Intel Gaudi2 devices. |
| Software Dependencies | No | The paper mentions 'Megatron-Deep Speed' in the context of the code repository. However, it does not provide specific version numbers for software dependencies like Python, PyTorch, or other libraries used in the experiments. |
| Experiment Setup | Yes | We trained the models on the open-source Red Pajama dataset (Computer, 2023) for 2 trillion tokens, maintaining hyperparameters consistent with Touvron et al. (2023). The FP8 model was trained using the standard format (Micikevicius et al., 2022) which includes saving a high precision weight matrix and quantization to E4M3 for the forward phase and E5M2 for the backward phase with delayed scaling, similar to Nvidia s transformer Engine. Our investigation revealed that different precision requirements exist for each moment: 1. First Moment: The E4M3 format... 2. Second Moment: The E5M2 format... Micro BS 1 (from Table 3). |