Taming Transformer Without Using Learning Rate Warmup
Authors: Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments using Vi T, Swin Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup. The quantitative results are shown in Table 1. Our baseline model is the corresponding Transformer using a learning rate warmup; whereas baseline models without using learning rate warmup will crash. Adam W2 demonstrates a very competitive performance compared to the baseline method. These experimental results verify that our understanding of the training dynamics of the Transformer is rational. We also conduct an ablation study of the choice of τ in GPT and Swin-Transformer. The results are shown in Figure 6. |
| Researcher Affiliation | Collaboration | Xianbiao Qi1, Yelin He1, Jiaquan Ye1, Chun-Guang Li2 , Bojia Zi3, Xili Dai4, Qin Zou5, Rong Xiao1 1Intellifusion Inc. 2BUPT 3CUHK 4HKUST (GZ) 5WHU |
| Pseudocode | Yes | Algorithm 1 Adam W2: Taming Transformer via Weyl Inequality without learning rate warmup. |
| Open Source Code | No | For the Vi T implementation, we use Timm Wightman (2019), in which timm library provides rich model architectures of many pre-trained image models in Py Torch. For the GPT implementation, we use nano GPT, which uses Layer Norm without a bias term, and thus only watch 13 terms, rather than 15 terms in Vi T. |
| Open Datasets | Yes | Our experiments include image classification on Image Net (Deng et al., 2009) and large language model on Open Web Text (Gokaslan & Cohen) dataset. |
| Dataset Splits | Yes | Our experiments include image classification on Image Net (Deng et al., 2009) and large language model on Open Web Text (Gokaslan & Cohen) dataset. We list some training configurations in Appendix N. |
| Hardware Specification | Yes | To reduce the training time, we limited our training to 100K steps instead of the full 600K steps. The comparison results are presented in Figure 13. We can see from Figure 13, nano GPT-large achieves a stable training without warmup and obtains a similar validation loss with its counterpart, GPT2-large. This further verifies our understanding to the model crash of Transformer. ...requiring two weeks to train 600K steps on 16 A800 GPUs. |
| Software Dependencies | No | For the Vi T implementation, we use Timm Wightman (2019), in which timm library provides rich model architectures of many pre-trained image models in Py Torch. For the GPT implementation, we use nano GPT... |
| Experiment Setup | Yes | We list the training configurations of Vi T, GPT, Swin-Transformer and Flatten-Swin in Table 2. For Vi T, GPT, Swin-Transformer and Flatten-Swin, we do not use learning rate warmup. For GPT, we follow the experimental configurations of nano GPT (Karpathy, 2022), all parameters are same as GPT2 (Radford et al., 2019). For Vi T, we use Timm (Wightman, 2019). For Swin-Transformer, we use the original code provided by Liu et al. (2021). For Flatten-Swin, we use the original code provided by (Han et al., 2023). Table 2: Training configurations for Vi T, GPT and Swin-Transformer. |