Hyper-Connections

Authors: Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. Using Pre-Norm as a baseline, we demonstrate the significant benefits of hyper-connections, including 1B and 7B dense models as well as 7B Mo E models, as detailed in 4. The benefits are particularly prominent for OLMo E (Muennighoff et al., 2024) as presented in Fig.1. The model utilizing DHC converges 1.8 times faster and shows an improvement of 6 points on ARC-Challenge compared to the baseline trained with 500 B tokens. We conduct ablation studies on 1B models and assess the effectiveness of our method at the 7B model scale. We evaluate the effectiveness of hyper-connections on the 7B model, training a model with DHCs with an expansion rate of 4, denoted as OLMo-7B-DHC 4. According to Table 5, OLMo-7B-DHC 4 significantly outperforms the baseline OLMo-7B model in all average metrics. We evaluate the effectiveness of hyper-connections on the Mixture-of-Experts (Mo E) model. The full results are shown in Fig. 9, which illustrates that hyper-connections outperform residual connections in almost all metrics. In many metrics, our method requires only half of the training tokens to achieve the same performance as the baseline. Fig. 1 and Table 6 highlight some of the results, such as a reduction in training loss of approximately 0.027, a reduction in loss on the C4-en validation set of 0.028, an improvement of 6 points on the ARC-Challengeand an improvement of 1.2 points on MMLU Var.
Researcher Affiliation Industry Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou Seed-Foundation-Model Team, Byte Dance EMAIL. This research was conducted at Byte Dance Inc.
Pseudocode Yes The Py Torch implementations for both the static and dynamic variants of hyper-connections are detailed in Algorithm 2 and 3. Algorithm 1 Network with Hyper-Connections. Algorithm 2 Pseudocode of hyper-connections in a Py Torch-like style. Algorithm 3 Pseudocode of transformer with hyper-connections in a Py Torch-like style.
Open Source Code No The Py Torch implementations for both the static and dynamic variants of hyper-connections are detailed in Algorithm 2 and 3. There is no explicit statement of code release or a link to a repository for the methodology described in the paper.
Open Datasets Yes For dense models, we use dolmap-v1.5-sample (Soldaini et al., 2024) as our training dataset. For Mo E models, we train the OLMo E-1B-7B model, both with and without hyper-connections, on the OLMOE-MIX dataset. We use the ILSVRC-2012 Image Net dataset (Deng et al., 2009) with 1k classes and 1.3M images (see Image Net in the following) for image generation and classification.
Dataset Splits Yes We employ the experimental setup outlined by OLMo (Groeneveld et al., 2024) for dense models and by OLMo E (Muennighoff et al., 2024) for Mo E models. In accordance with the methodology of OLMo (Groeneveld et al., 2024), we report the average perplexities (PPL) and losses on both the V2 and V3 validation sets, along with the average metrics for zero-shot evaluation on downstream benchmarks (refer to Table 13). Table 13 lists V2 Validation Sets and V3 Validation Sets by specific names, indicating predefined or standard splits from the OLMo methodology.
Hardware Specification No The actual memory footprint is empirically measured on 8 GPUs, as shown in Table 9. No specific models or types of GPUs/CPUs are mentioned.
Software Dependencies No The Py Torch implementations for both the static and dynamic variants of hyper-connections are detailed in Algorithm 2 and 3. Table 12: Training hyperparameters for Vi T. Optimizer Adam W (β1 = 0.9, β2 = 0.999, ϵ = 1e 8). Precision bf16. These specify software components (PyTorch, AdamW, bf16) but do not provide specific version numbers for them.
Experiment Setup Yes Experiment Settings. We employ the experimental setup outlined by OLMo (Groeneveld et al., 2024) for dense models and by OLMo E (Muennighoff et al., 2024) for Mo E models. All experiments are trained on 500B tokens. At initialization, we scale the std of the weights of the output module at all layers, including those of the second linear layer of the feedforward network and the output projector of the attention module, by a factor of n, where n represents the expansion rate. Table 12: Training hyperparameters for Vi T. Learning Rate (lr) 0.003 Batch Size 4096 Scheduler Cosine Annealing with Linear Warmup (10k steps) Data Augmentation Mixup (α = 0.2) Epochs 300 Optimizer Adam W (β1 = 0.9, β2 = 0.999, ϵ = 1e 8) Gradient Clipping 1.0 Weight Decay 0.3 Dropout 0.1 Precision bf16