Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
Authors: Pengxiang Li, Lu Yin, Shiwei Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. |
| Researcher Affiliation | Academia | Pengxiang Li Dalian University of Technology EMAIL Lu Yin University of Surrey EMAIL Shiwei Liu University of Oxford EMAIL |
| Pseudocode | No | The paper describes methods and equations, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/pixeli99/Mix LN. |
| Open Datasets | Yes | Specifically, we train LLa Ma-130M models on the C4 dataset with either Pre-LN or Post-LN... Specifically, for LLa MA2-7B, we choose the commonly used MMLU (Hendrycks et al., 2020) as the evaluation task; for BERT-large, we opt for SQu AD v1.1 (Rajpurkar, 2016) as the evaluation task. Given the limited capacity of our in-house trained LLMs, we choose ARC-e (Clark et al., 2018) after supervised fine-tuning... on Commonsense170K (Hu et al., 2023)... on the ultrafeedback dataset... train the updated model for 120 epochs on Image Net-1K following Liu et al. (2022a). |
| Dataset Splits | No | The paper mentions using well-known datasets like C4, MMLU, SQuAD v1.1, ARC-e, Commonsense170K, ultrafeedback, and ImageNet-1K. While these datasets typically have standard splits, the paper does not explicitly provide the specific percentages, sample counts, or citations to predefined train/validation/test splits used for all experiments. |
| Hardware Specification | No | The paper mentions that 'The training of LLMs is extraordinarily resource-intensive, often requiring thousands of GPUs or TPUs' in a general context, but does not specify the particular hardware (e.g., specific GPU models, CPU models, or memory details) used for running the experiments described in this paper. |
| Software Dependencies | No | The paper mentions using Adam optimizer, RMSNorm, and SwiGLU activations, but it does not specify the version numbers for any software libraries, programming languages, or other dependencies used for implementation. |
| Experiment Setup | Yes | Table 7 shows the most hyperparameters of LLa MA models across model sizes. We use a max sequence length of 256 for all models, with a batch size of 512, and a total of 131K tokens per batch. Learning rate warmup is applied to the first 10% of the training steps. We train models using Adam with a cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. We use a learning rate of 1e-3 for models with 250M parameters and below, and a learning rate of 5e-4 for the 1B parameter model. |