Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
u-$\mu$P: The Unit-Scaled Maximal Update Parametrization
Authors: Charles Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Prince, Björn Deiseroth, Andres Felipe Cruz Salinas, Carlo Luschi, Samuel Weinbach, Douglas Orr
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS Our experiments use the Llama architecture (Touvron et al., 2023a) trained on Wiki Text-103 (Merity et al., 2017) (except large-scale runs in Section 4.4). ... Figure 1: (a) Two different HP sweeping processes used for µP and u-µP proxy models. ... (b) Using the best proxy HPs from (a), we train many models at different widths and LRs. ... (c) We re-train with a simple un-scaled .to(float8) cast on matmul inputs. ... 4.4 FP8 TRAINING In this section we justify the simple mixed-precision scheme described in Section 3.2 and demonstrate that it can be used to train u-µP models out-of-the-box. Proof-of-concept. Figure 5 shows the RMS of all linear layer inputs for a moderately sized transformer. ... Larger scale. Next we consider a more realistic training scenario. Using the same architecture, and following the steps set out in our u-µP user-guide (Appendix D), we train our target models on 300B tokens of the Slim Pajama dataset (Shen et al., 2023) (see Appendix A.8 for training details). ... All FP8 runs converge and show no significant loss degradation. In comparison to SP, the u-µP models have a qualitatively different training curve with a higher loss for most of training that catches up in latter stages, hinting at a fundamentally different optimization trajectory. In terms of downstream performance, both of the u-µP 7B models are competitive with SP. In particular, the scores of the FP8 model are mostly on par with the BF16 models (see Table 4). |
| Researcher Affiliation | Industry | Charlie Blake Graphcore Constantin Eichenberg Aleph Alpha Josef Dean Graphcore Lukas Balles Aleph Alpha Luke Y. Prince Graphcore Björn Deiseroth Aleph Alpha Andres Felipe Cruz-Salinas Cohere Carlo Luschi Graphcore Samuel Weinbach Aleph Alpha Douglas Orr Graphcore Correspondence to: EMAIL, EMAIL. |
| Pseudocode | Yes | Algorithm 1 Transfer Error Require: A fixed HP with candidate values F = {f1, , fn}, a transfer HP with candidate values T = {t1, , tm}, a function that gives the final validation loss for the pair of HPs L : F T R (assuming all other HPs are fixed at default values). err 0 f , t argmin(L) for f in F do if f = f then t argmin(L(f)) err += L(f , t) L(f , t ) end if end for return err/(n 1) |
| Open Source Code | No | The codebase used for Tensor Programs V, allowing us to compare µP and u-µP in the same setting. ... The proposed implementation in the mup library (Microsoft, 2024) reflects this, requiring an extra base model to be created and the original model to be re-initialized. ... Each model was trained on several Nvidia A100 (80GB) or H100 GPUs, with all FP8 experiments conducted on the H100 chips utilizing their native FP8 support. For the FP8 operations we use Py Torch s torch._scaled_mm function as a backbone. ... It should be noted that as of Py Torch version 2.3, torch._scaled_mm always computes amax as well as the matrix multiplication. |
| Open Datasets | Yes | Our experiments use the Llama architecture (Touvron et al., 2023a) trained on Wiki Text-103 (Merity et al., 2017) (except large-scale runs in Section 4.4). ... Slim Pajama dataset (Shen et al., 2023) (see Appendix A.8 for training details). |
| Dataset Splits | No | Dataset Wiki Text-103 (Merity et al., 2017) Sequence length 256 Vocab size 32000 Training set tokens 138M Architecture Llama (Touvron et al., 2023a) ... Dataset Slim Pajama (Shen et al., 2023) Sequence length 4096 Vocab size 65536 Training set tokens 600B |
| Hardware Specification | Yes | Each model was trained on several Nvidia A100 (80GB) or H100 GPUs, with all FP8 experiments conducted on the H100 chips utilizing their native FP8 support. For the FP8 operations we use Py Torch s torch._scaled_mm function as a backbone. ... Figure 22 demonstrates hardware utilization for FP8, FP16, and FP32 matrix multiplications on a single NVIDIA H100 PCIe card. |
| Software Dependencies | Yes | For the FP8 operations we use Py Torch s torch._scaled_mm function as a backbone. ... It should be noted that as of Py Torch version 2.3, torch._scaled_mm always computes amax as well as the matrix multiplication. ... Optimizer Adam W (β1, β2, ϵ) = (0.9, 0.999, 10 8) Weight decay 2 13, independent (Loshchilov & Hutter, 2019) |
| Experiment Setup | Yes | Where not specified otherwise, the default setting used in our experiments are given in Table 5. These also represent the settings of our proxy model. Table 5: Default hyperparameters and training settings. Dataset Wiki Text-103 (Merity et al., 2017) Sequence length 256 Vocab size 32000 Training set tokens 138M Architecture Llama (Touvron et al., 2023a) (Transformer, Pre Norm, RMSNorm, Swi GLU, Ro PE, untied embeddings), non-trainable RMSNorm parameters. Width 256 (scaled up to 4096) Depth 4 Number of heads 4 (scaled up to 64) Head dimension 64 Total parameters 19.5M (scaled up to 1.07B) Batch size 64 Training steps 8192 (0.97 epochs) LR schedule Cosine to 10%, 2000 steps warm-up Optimizer Adam W (β1, β2, ϵ) = (0.9, 0.999, 10 8) Weight decay 2 13, independent (Loshchilov & Hutter, 2019) Dropout 0.0 µP HP search range η [2 10, 2 6] ˆηemb [20, 28] σinit, αemb, αattn, αoutput [2 2, 22] u-µP HP search range η [2 1, 23] αattn [2 2, 22] αresidual, αresidual-attn-ratio, αffn-act, αoutput [2 3, 23] µP HP defaults σinit = αemb = αattn = αoutput = ˆηemb = 1 u-µP HP defaults αresidual = αresidual-attn-ratio = αffn-act = αoutput = αattn = 1 |