reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

$\mathrmμ$nit Scaling: Simple and Scalable FP8 LLM Training

Authors: Saaketh Narayan, Abhay Gupta, Mansheej Paul, Davis Blalock

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our method by training models with parameters ranging from 1B to 13B, performing all hidden linear layer computations in FP8. We achieve quality equal to higher-precision baselines while also training up to 33% faster.
Researcher Affiliation	Industry	1Work done while at Databricks Mosaic Research 2Databricks Mosaic Research, San Francisco, CA. Correspondence to: Saaketh Narayan <EMAIL>, Davis Blalock <EMAIL>.
Pseudocode	No	The paper contains mathematical equations, propositions, and descriptions of methods, but no clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor any structured code-like procedures.
Open Source Code	No	The paper mentions using "Databricks Mosaic ML LLMFoundry (Mosaic ML, 2022a), Composer (Mosaic ML, 2021), and Streaming (Mosaic ML, 2022b) libraries" which are third-party tools. However, it does not contain an explicit statement by the authors that they are releasing the source code for the µnit Scaling methodology described in this paper, nor does it provide a direct link to a code repository for their specific implementation.
Open Datasets	No	The paper mentions training models on "approximately compute-optimal token budgets" and evaluating them using the "Databricks Model Gauntlet" (Dohmann, 2023; Barton, 2024). While the Model Gauntlet is a benchmark, the specific datasets used for the main training of LLMs (e.g., common LLM training datasets like C4 or The Pile) are not named, nor is concrete access information (links, DOIs, specific citations for the datasets themselves) provided for any training dataset used by the authors.
Dataset Splits	No	The paper describes training configurations like "10,000 training steps with a global batch size of 64 and sequence length of 1024" and "compute-optimal token budgets". It also mentions evaluating models on specific tasks in the "Databricks Model Gauntlet". However, it does not specify explicit training, validation, and test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for the data used in their experiments. The evaluation tasks use their own internal splits but the paper doesn't specify how the main training data was split for their models.
Hardware Specification	Yes	All models were trained on Nvidia H100 GPUs using the Databricks Mosaic ML LLMFoundry (Mosaic ML, 2022a), Composer (Mosaic ML, 2021), and Streaming (Mosaic ML, 2022b) libraries. [...] All models were benchmarked on 64 NVIDIA H100 GPUs, and characteristics such as batch size and distributed training configuration were held constant.
Software Dependencies	No	The paper mentions using the "Lion optimizer (Chen et al., 2023)" and the "Databricks Mosaic ML LLMFoundry (Mosaic ML, 2022a), Composer (Mosaic ML, 2021), and Streaming (Mosaic ML, 2022b) libraries", as well as "cublasLt Matmul() operation (NVIDIA Corporation, 2024)" and "Triton (Tillet et al., 2019) kernel". However, it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	We train 1B, 3B, 7B, and 13B parameter LLMs on approximately compute-optimal token budgets (∼20x token-to-parameter ratio) using SP and µS, and in both BF16 and FP8, resulting in 4 individual models for each model size. The training configurations are detailed in Table 4. [...] All models use multi-headed attention (Vaswani et al., 2017) and were trained for 10,000 training steps with a global batch size of 64 and sequence length of 1024 (i.e., 655M total tokens). [...] we used the Lion optimizer (Chen et al., 2023) with fully decoupled weight decay and a cosine learning rate schedule decaying to 10% of the maximum learning rate. [...] Table 4. Large model training configurations. Model training configurations for 1B, 3B, 7B, and 13B models. Only µS models use the residual coefficient τ, which is dictated by model depth using results in Appendix A.3. (Includes columns for Model Params, Tokens, TPR, Steps, Batch Sz., Seq. Len., Width, Depth, # Heads, τ)