MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

Authors: Yang Luo, Zangwei Zheng, Ziheng Qin, Zirui Zhu, Yong Liu, Yang You

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens.
Researcher Affiliation Academia 1School of Computing, National University of Singapore, Singapore. Correspondence to: Yang You <EMAIL>.
Pseudocode Yes Algorithm 1 summarizes our proposed MERIT optimizer. The design of MERIT comes from two parts: maximum-normalized trust ratio and element-wise refinement.
Open Source Code Yes Code is available at https://github.com/NUS-HPC-AI-Lab/MERIT/.
Open Datasets Yes Language modeling. We conducted large-batch training experiments on Open Web Text (Gokaslan & Cohen, 2019), training autoregressive models from scratch using settings derived from the Chinchilla scaling law (Hoffmann et al., 2022). ... We conducted additional experiments using C-Optim 1 to validate the performance of the proposed MERIT optimizer in the large-batch training of Llama models (Dubey et al., 2024). ... we trained models on 2.6B tokens from the C4 dataset (Raffel et al., 2023).
Dataset Splits Yes For data organization, we adopt the train-validation split provided by nano GPT. The training dataset comprises 9 billion tokens, while the validation set contains 4.4 million tokens.
Hardware Specification Yes All models are trained on H100 GPUs. The 125M and 355M parameter models are trained on systems equipped with two H100 GPUs, whereas the 770M parameter models require machines with eight H100 GPUs.
Software Dependencies No We implement the algorithms in Py Torch (Paszke et al., 2019) and train all the models in bfloat16. All models are trained on H100 GPUs.
Experiment Setup Yes Following the Chinchilla scaling law, we use batch size 1K for GPT-2 small with 2B training tokens, 4K for GPT-2 medium with 8B tokens, and 8K for GPT-2 large with 16B tokens for the large-batch training setting. Our learning rate (LR) follows a cosine schedule, with the final LR set to 0.1 of the peak LR. We maintain a constant LR warm-up ratio of 0.02 and apply standard gradient clipping (norm) with a threshold of 1.0. In the case of Sophia-G, we select 240 examples from each minibatch to compute the diagonal Gauss-Newton and update the diagonal Hessian every 10 steps. ... For all models, all learning rates are tuned with grid search. The weight decay is set to 0.1 for all optimizers for a fair comparison. We follow Liu et al. (2024) for the settings of β values: For Adam W: β1 = 0.9 and β2 = 0.95. For Lion: β1 = 0.95 and β2 = 0.98. For Sophia-G: β1 = 0.92 and β2 = 0.99.