reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Taming Transformer Without Using Learning Rate Warmup

Authors: Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments using Vi T, Swin Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup. The quantitative results are shown in Table 1. Our baseline model is the corresponding Transformer using a learning rate warmup; whereas baseline models without using learning rate warmup will crash. Adam W2 demonstrates a very competitive performance compared to the baseline method. These experimental results verify that our understanding of the training dynamics of the Transformer is rational. We also conduct an ablation study of the choice of τ in GPT and Swin-Transformer. The results are shown in Figure 6.
Researcher Affiliation	Collaboration	Xianbiao Qi1, Yelin He1, Jiaquan Ye1, Chun-Guang Li2 , Bojia Zi3, Xili Dai4, Qin Zou5, Rong Xiao1 1Intellifusion Inc. 2BUPT 3CUHK 4HKUST (GZ) 5WHU
Pseudocode	Yes	Algorithm 1 Adam W2: Taming Transformer via Weyl Inequality without learning rate warmup.
Open Source Code	No	For the Vi T implementation, we use Timm Wightman (2019), in which timm library provides rich model architectures of many pre-trained image models in Py Torch. For the GPT implementation, we use nano GPT, which uses Layer Norm without a bias term, and thus only watch 13 terms, rather than 15 terms in Vi T.
Open Datasets	Yes	Our experiments include image classification on Image Net (Deng et al., 2009) and large language model on Open Web Text (Gokaslan & Cohen) dataset.
Dataset Splits	Yes	Our experiments include image classification on Image Net (Deng et al., 2009) and large language model on Open Web Text (Gokaslan & Cohen) dataset. We list some training configurations in Appendix N.
Hardware Specification	Yes	To reduce the training time, we limited our training to 100K steps instead of the full 600K steps. The comparison results are presented in Figure 13. We can see from Figure 13, nano GPT-large achieves a stable training without warmup and obtains a similar validation loss with its counterpart, GPT2-large. This further verifies our understanding to the model crash of Transformer. ...requiring two weeks to train 600K steps on 16 A800 GPUs.
Software Dependencies	No	For the Vi T implementation, we use Timm Wightman (2019), in which timm library provides rich model architectures of many pre-trained image models in Py Torch. For the GPT implementation, we use nano GPT...
Experiment Setup	Yes	We list the training configurations of Vi T, GPT, Swin-Transformer and Flatten-Swin in Table 2. For Vi T, GPT, Swin-Transformer and Flatten-Swin, we do not use learning rate warmup. For GPT, we follow the experimental configurations of nano GPT (Karpathy, 2022), all parameters are same as GPT2 (Radford et al., 2019). For Vi T, we use Timm (Wightman, 2019). For Swin-Transformer, we use the original code provided by Liu et al. (2021). For Flatten-Swin, we use the original code provided by (Han et al., 2023). Table 2: Training configurations for Vi T, GPT and Swin-Transformer.