Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

Authors: Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our theoretical insight, we conduct extensive experiments on Transformers up to 3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability.
Researcher Affiliation Collaboration 1NAVER Cloud 2Korea Advanced Institute of Science and Technology (KAIST) 3NAVER AI Lab.
Pseudocode No The paper contains mathematical formulas and propositions but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions using third-party tools like Megatron-LM and Hugging Face Transformers library for experiments, but does not include any explicit statement or link for the source code of the methodology described in this paper.
Open Datasets Yes We use the DCLM-baseline dataset (Li et al., 2024a), along with the cl100k base version of the Tik Token tokenizer. For the evaluation loss, we used 10K random samples from the C4 dataset (Raffel et al., 2020). LIMA dataset (Ouyang et al., 2022; Zhou et al., 2023). Wikitext dataset (Merity et al., 2016).
Dataset Splits No The paper mentions training on '30 billion tokens' and using '10K random samples from the C4 dataset' for evaluation loss, but it does not specify explicit train/test/validation splits (percentages or counts) for the datasets used to reproduce the experiments.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used to run the experiments.
Software Dependencies No The paper mentions using Megatron-LM, Tik Token tokenizer, Language Model Evaluation Harness, and Hugging Face Transformers library, but it does not specify any version numbers for these software dependencies.
Experiment Setup Yes Excluding the embedding parameters, the model size is set to the parameters 400M, 1.5B and 3.2B, respectively. Each model is trained on 30 billion tokens. ... We perform a exploration of the learning rates, ranging from 1 10 4 to 5 10 3... The sequence length is set to 8192, and the weight decay coefficient is fixed at 0.033. ... For normalization layer, we primarily employ RMSNorm. Further details are in Appendix D. [Appendix D includes]: Global Batch Size: 256, Weight Decay: 0.033, Iterations: 14400, Optimizer: Adam, LR Schedule: Cosine, Warmup: 10%, Weight Initialization: 0.02. Max Position Embeddings: 8192, Position Embedding Type: Rope, Untie-embeddings-and-output-weights: True. Model configurations (nlayers, nheads, dmodel, dhead) for each size.