reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

Authors: Yang Luo, Zangwei Zheng, Ziheng Qin, Zirui Zhu, Yong Liu, Yang You

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens.
Researcher Affiliation	Academia	1School of Computing, National University of Singapore, Singapore. Correspondence to: Yang You <EMAIL>.
Pseudocode	Yes	Algorithm 1 summarizes our proposed MERIT optimizer. The design of MERIT comes from two parts: maximum-normalized trust ratio and element-wise refinement.
Open Source Code	Yes	Code is available at https://github.com/NUS-HPC-AI-Lab/MERIT/.
Open Datasets	Yes	Language modeling. We conducted large-batch training experiments on Open Web Text (Gokaslan & Cohen, 2019), training autoregressive models from scratch using settings derived from the Chinchilla scaling law (Hoffmann et al., 2022). ... We conducted additional experiments using C-Optim 1 to validate the performance of the proposed MERIT optimizer in the large-batch training of Llama models (Dubey et al., 2024). ... we trained models on 2.6B tokens from the C4 dataset (Raffel et al., 2023).
Dataset Splits	Yes	For data organization, we adopt the train-validation split provided by nano GPT. The training dataset comprises 9 billion tokens, while the validation set contains 4.4 million tokens.
Hardware Specification	Yes	All models are trained on H100 GPUs. The 125M and 355M parameter models are trained on systems equipped with two H100 GPUs, whereas the 770M parameter models require machines with eight H100 GPUs.
Software Dependencies	No	We implement the algorithms in Py Torch (Paszke et al., 2019) and train all the models in bfloat16. All models are trained on H100 GPUs.
Experiment Setup	Yes	Following the Chinchilla scaling law, we use batch size 1K for GPT-2 small with 2B training tokens, 4K for GPT-2 medium with 8B tokens, and 8K for GPT-2 large with 16B tokens for the large-batch training setting. Our learning rate (LR) follows a cosine schedule, with the final LR set to 0.1 of the peak LR. We maintain a constant LR warm-up ratio of 0.02 and apply standard gradient clipping (norm) with a threshold of 1.0. In the case of Sophia-G, we select 240 examples from each minibatch to compute the diagonal Gauss-Newton and update the diagonal Hessian every 10 steps. ... For all models, all learning rates are tuned with grid search. The weight decay is set to 0.1 for all optimizers for a fair comparison. We follow Liu et al. (2024) for the settings of β values: For Adam W: β1 = 0.9 and β2 = 0.95. For Lion: β1 = 0.95 and β2 = 0.98. For Sophia-G: β1 = 0.92 and β2 = 0.99.