Peri-LN: Revisiting Normalization Layer in the Transformer Architecture
Authors: Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our theoretical insight, we conduct extensive experiments on Transformers up to 3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. |
| Researcher Affiliation | Collaboration | 1NAVER Cloud 2Korea Advanced Institute of Science and Technology (KAIST) 3NAVER AI Lab. |
| Pseudocode | No | The paper contains mathematical formulas and propositions but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using third-party tools like Megatron-LM and Hugging Face Transformers library for experiments, but does not include any explicit statement or link for the source code of the methodology described in this paper. |
| Open Datasets | Yes | We use the DCLM-baseline dataset (Li et al., 2024a), along with the cl100k base version of the Tik Token tokenizer. For the evaluation loss, we used 10K random samples from the C4 dataset (Raffel et al., 2020). LIMA dataset (Ouyang et al., 2022; Zhou et al., 2023). Wikitext dataset (Merity et al., 2016). |
| Dataset Splits | No | The paper mentions training on '30 billion tokens' and using '10K random samples from the C4 dataset' for evaluation loss, but it does not specify explicit train/test/validation splits (percentages or counts) for the datasets used to reproduce the experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used to run the experiments. |
| Software Dependencies | No | The paper mentions using Megatron-LM, Tik Token tokenizer, Language Model Evaluation Harness, and Hugging Face Transformers library, but it does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | Excluding the embedding parameters, the model size is set to the parameters 400M, 1.5B and 3.2B, respectively. Each model is trained on 30 billion tokens. ... We perform a exploration of the learning rates, ranging from 1 10 4 to 5 10 3... The sequence length is set to 8192, and the weight decay coefficient is fixed at 0.033. ... For normalization layer, we primarily employ RMSNorm. Further details are in Appendix D. [Appendix D includes]: Global Batch Size: 256, Weight Decay: 0.033, Iterations: 14400, Optimizer: Adam, LR Schedule: Cosine, Warmup: 10%, Weight Initialization: 0.02. Max Position Embeddings: 8192, Position Embedding Type: Rope, Untie-embeddings-and-output-weights: True. Model configurations (nlayers, nheads, dmodel, dhead) for each size. |