$\mathrmμ$nit Scaling: Simple and Scalable FP8 LLM Training
Authors: Saaketh Narayan, Abhay Gupta, Mansheej Paul, Davis Blalock
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method by training models with parameters ranging from 1B to 13B, performing all hidden linear layer computations in FP8. We achieve quality equal to higher-precision baselines while also training up to 33% faster. |
| Researcher Affiliation | Industry | 1Work done while at Databricks Mosaic Research 2Databricks Mosaic Research, San Francisco, CA. Correspondence to: Saaketh Narayan <EMAIL>, Davis Blalock <EMAIL>. |
| Pseudocode | No | The paper contains mathematical equations, propositions, and descriptions of methods, but no clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor any structured code-like procedures. |
| Open Source Code | No | The paper mentions using "Databricks Mosaic ML LLMFoundry (Mosaic ML, 2022a), Composer (Mosaic ML, 2021), and Streaming (Mosaic ML, 2022b) libraries" which are third-party tools. However, it does not contain an explicit statement by the authors that they are releasing the source code for the µnit Scaling methodology described in this paper, nor does it provide a direct link to a code repository for their specific implementation. |
| Open Datasets | No | The paper mentions training models on "approximately compute-optimal token budgets" and evaluating them using the "Databricks Model Gauntlet" (Dohmann, 2023; Barton, 2024). While the Model Gauntlet is a benchmark, the specific datasets used for the main training of LLMs (e.g., common LLM training datasets like C4 or The Pile) are not named, nor is concrete access information (links, DOIs, specific citations for the datasets themselves) provided for any training dataset used by the authors. |
| Dataset Splits | No | The paper describes training configurations like "10,000 training steps with a global batch size of 64 and sequence length of 1024" and "compute-optimal token budgets". It also mentions evaluating models on specific tasks in the "Databricks Model Gauntlet". However, it does not specify explicit training, validation, and test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for the data used in their experiments. The evaluation tasks use their own internal splits but the paper doesn't specify how the main training data was split for their models. |
| Hardware Specification | Yes | All models were trained on Nvidia H100 GPUs using the Databricks Mosaic ML LLMFoundry (Mosaic ML, 2022a), Composer (Mosaic ML, 2021), and Streaming (Mosaic ML, 2022b) libraries. [...] All models were benchmarked on 64 NVIDIA H100 GPUs, and characteristics such as batch size and distributed training configuration were held constant. |
| Software Dependencies | No | The paper mentions using the "Lion optimizer (Chen et al., 2023)" and the "Databricks Mosaic ML LLMFoundry (Mosaic ML, 2022a), Composer (Mosaic ML, 2021), and Streaming (Mosaic ML, 2022b) libraries", as well as "cublasLt Matmul() operation (NVIDIA Corporation, 2024)" and "Triton (Tillet et al., 2019) kernel". However, it does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | We train 1B, 3B, 7B, and 13B parameter LLMs on approximately compute-optimal token budgets (∼20x token-to-parameter ratio) using SP and µS, and in both BF16 and FP8, resulting in 4 individual models for each model size. The training configurations are detailed in Table 4. [...] All models use multi-headed attention (Vaswani et al., 2017) and were trained for 10,000 training steps with a global batch size of 64 and sequence length of 1024 (i.e., 655M total tokens). [...] we used the Lion optimizer (Chen et al., 2023) with fully decoupled weight decay and a cosine learning rate schedule decaying to 10% of the maximum learning rate. [...] Table 4. Large model training configurations. Model training configurations for 1B, 3B, 7B, and 13B models. Only µS models use the residual coefficient τ, which is dictated by model depth using results in Appendix A.3. (Includes columns for Model Params, Tokens, TPR, Steps, Batch Sz., Seq. Len., Width, Depth, # Heads, τ) |