Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Authors: Pengxiang Li, Lu Yin, Shiwei Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training.
Researcher Affiliation Academia Pengxiang Li Dalian University of Technology EMAIL Lu Yin University of Surrey EMAIL Shiwei Liu University of Oxford EMAIL
Pseudocode No The paper describes methods and equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/pixeli99/Mix LN.
Open Datasets Yes Specifically, we train LLa Ma-130M models on the C4 dataset with either Pre-LN or Post-LN... Specifically, for LLa MA2-7B, we choose the commonly used MMLU (Hendrycks et al., 2020) as the evaluation task; for BERT-large, we opt for SQu AD v1.1 (Rajpurkar, 2016) as the evaluation task. Given the limited capacity of our in-house trained LLMs, we choose ARC-e (Clark et al., 2018) after supervised fine-tuning... on Commonsense170K (Hu et al., 2023)... on the ultrafeedback dataset... train the updated model for 120 epochs on Image Net-1K following Liu et al. (2022a).
Dataset Splits No The paper mentions using well-known datasets like C4, MMLU, SQuAD v1.1, ARC-e, Commonsense170K, ultrafeedback, and ImageNet-1K. While these datasets typically have standard splits, the paper does not explicitly provide the specific percentages, sample counts, or citations to predefined train/validation/test splits used for all experiments.
Hardware Specification No The paper mentions that 'The training of LLMs is extraordinarily resource-intensive, often requiring thousands of GPUs or TPUs' in a general context, but does not specify the particular hardware (e.g., specific GPU models, CPU models, or memory details) used for running the experiments described in this paper.
Software Dependencies No The paper mentions using Adam optimizer, RMSNorm, and SwiGLU activations, but it does not specify the version numbers for any software libraries, programming languages, or other dependencies used for implementation.
Experiment Setup Yes Table 7 shows the most hyperparameters of LLa MA models across model sizes. We use a max sequence length of 256 for all models, with a batch size of 512, and a total of 131K tokens per batch. Learning rate warmup is applied to the first 10% of the training steps. We train models using Adam with a cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. We use a learning rate of 1e-3 for models with 250M parameters and below, and a learning rate of 5e-4 for the 1B parameter model.