reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Authors: Pengxiang Li, Lu Yin, Shiwei Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training.
Researcher Affiliation	Academia	Pengxiang Li Dalian University of Technology EMAIL Lu Yin University of Surrey EMAIL Shiwei Liu University of Oxford EMAIL
Pseudocode	No	The paper describes methods and equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/pixeli99/Mix LN.
Open Datasets	Yes	Specifically, we train LLa Ma-130M models on the C4 dataset with either Pre-LN or Post-LN... Specifically, for LLa MA2-7B, we choose the commonly used MMLU (Hendrycks et al., 2020) as the evaluation task; for BERT-large, we opt for SQu AD v1.1 (Rajpurkar, 2016) as the evaluation task. Given the limited capacity of our in-house trained LLMs, we choose ARC-e (Clark et al., 2018) after supervised fine-tuning... on Commonsense170K (Hu et al., 2023)... on the ultrafeedback dataset... train the updated model for 120 epochs on Image Net-1K following Liu et al. (2022a).
Dataset Splits	No	The paper mentions using well-known datasets like C4, MMLU, SQuAD v1.1, ARC-e, Commonsense170K, ultrafeedback, and ImageNet-1K. While these datasets typically have standard splits, the paper does not explicitly provide the specific percentages, sample counts, or citations to predefined train/validation/test splits used for all experiments.
Hardware Specification	No	The paper mentions that 'The training of LLMs is extraordinarily resource-intensive, often requiring thousands of GPUs or TPUs' in a general context, but does not specify the particular hardware (e.g., specific GPU models, CPU models, or memory details) used for running the experiments described in this paper.
Software Dependencies	No	The paper mentions using Adam optimizer, RMSNorm, and SwiGLU activations, but it does not specify the version numbers for any software libraries, programming languages, or other dependencies used for implementation.
Experiment Setup	Yes	Table 7 shows the most hyperparameters of LLa MA models across model sizes. We use a max sequence length of 256 for all models, with a batch size of 512, and a total of 131K tokens per batch. Learning rate warmup is applied to the first 10% of the training steps. We train models using Adam with a cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. We use a learning rate of 1e-3 for models with 250M parameters and below, and a learning rate of 5e-4 for the 1B parameter model.