reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

Authors: Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate our theoretical insight, we conduct extensive experiments on Transformers up to 3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability.
Researcher Affiliation	Collaboration	1NAVER Cloud 2Korea Advanced Institute of Science and Technology (KAIST) 3NAVER AI Lab.
Pseudocode	No	The paper contains mathematical formulas and propositions but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using third-party tools like Megatron-LM and Hugging Face Transformers library for experiments, but does not include any explicit statement or link for the source code of the methodology described in this paper.
Open Datasets	Yes	We use the DCLM-baseline dataset (Li et al., 2024a), along with the cl100k base version of the Tik Token tokenizer. For the evaluation loss, we used 10K random samples from the C4 dataset (Raffel et al., 2020). LIMA dataset (Ouyang et al., 2022; Zhou et al., 2023). Wikitext dataset (Merity et al., 2016).
Dataset Splits	No	The paper mentions training on '30 billion tokens' and using '10K random samples from the C4 dataset' for evaluation loss, but it does not specify explicit train/test/validation splits (percentages or counts) for the datasets used to reproduce the experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used to run the experiments.
Software Dependencies	No	The paper mentions using Megatron-LM, Tik Token tokenizer, Language Model Evaluation Harness, and Hugging Face Transformers library, but it does not specify any version numbers for these software dependencies.
Experiment Setup	Yes	Excluding the embedding parameters, the model size is set to the parameters 400M, 1.5B and 3.2B, respectively. Each model is trained on 30 billion tokens. ... We perform a exploration of the learning rates, ranging from 1 10 4 to 5 10 3... The sequence length is set to 8192, and the weight decay coefficient is fixed at 0.033. ... For normalization layer, we primarily employ RMSNorm. Further details are in Appendix D. [Appendix D includes]: Global Batch Size: 256, Weight Decay: 0.033, Iterations: 14400, Optimizer: Adam, LR Schedule: Cosine, Warmup: 10%, Weight Initialization: 0.02. Max Position Embeddings: 8192, Position Embedding Type: Rope, Untie-embeddings-and-output-weights: True. Model configurations (nlayers, nheads, dmodel, dhead) for each size.