Differential Transformer
Authors: Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on language modeling show that DIFF Transformer outperforms Transformer in various settings of scaling up model size and training tokens. We conduct extensive experiments on language modeling. We scale up DIFF Transformer in terms of parameter count, training tokens, and context length. The scaling curves indicate that DIFF Transformer requires only about 65% of model size or training tokens needed by Transformer to achieve comparable language modeling performance. Moreover, DIFF Transformer outperforms Transformer in various downstream tasks. The long-sequence evaluation also shows that DIFF Transformer is highly effective in utilizing the increasing context. In addition, the experimental results demonstrate that DIFF Transformer has intriguing advantages for large language models. For example, the proposed method substantially outperforms Transformer in key information retrieval, hallucination mitigation, in-context learning, and mathematical reasoning. DIFF Transformer also reduces outliers in model activations, which provides new opportunities for quantization. The findings establish DIFF Transformer as an effective and distinctive foundation architecture for large language models. |
| Researcher Affiliation | Collaboration | BNRist, Tsinghua University, Beijing, China Microsoft Research |
| Pseudocode | Yes | Figure 2: Multi-head differential attention. Each head takes the difference between two softmax attention maps to cancel out attention noise. λ is a learnable scalar that is initialized to λinit. Group Norm applies normalization to each head independently. A fixed multiplier (1 λinit) is used after Group Norm, which aligns the gradient flow with Transformer. The code implementation is available at https://aka.ms/Diff-Transformer. def Diff Attn(X, W_q, W_k, W_v, λ): Q1, Q2 = split(X @ W_q) K1, K2 = split(X @ W_k) V = X @ W_v # Qi, Ki: [b, n, d]; V: [b, n, 2d] s = 1 / sqrt(d) A1 = Q1 @ K1.transpose( 1, 2) s A2 = Q2 @ K2.transpose( 1, 2) s return (softmax(A1) λ softmax(A2)) @ V def Multi Head(X, W_q, W_k, W_v, W_o, λ): O = Group Norm([Diff Attn(X, W_qi, W_ki, W_vi, λ) for i in range(h)]) O = O (1 λinit) return Concat(O) @ W_o |
| Open Source Code | Yes | The code implementation is available at https://aka.ms/Diff-Transformer. |
| Open Datasets | Yes | We evaluate Differential Transformer for large language models from the following perspectives. First, we compare the proposed architecture with Transformers in various downstream tasks (Section 3.1) and study the properties of scaling up model size and training tokens (Section 3.2). Second, we conduct a length extension to 64K and evaluate the long-sequence modeling capability (Section 3.3). Third, we present the results of key information retrieval, contextual hallucination evaluation, and in-context learning (Sections 3.4 3.6). Forth, we show that Differential Transformer can reduce outliers in the model activations compared to Transformer (Section 3.7). Fifth, we conduct extensive ablation studies for various design choices (Section 3.8). We train 3B-size DIFF Transformer language models on 1T tokens and compare with previous well-trained Transformer-based models (Geng & Liu, 2023; Tow, 2023; Tow et al., 2023) in various downstream tasks. As described in Appendix B, we follow the same setting to train a 3B-size Transformer language model on 350B tokens. The Needle-In-A-Haystack (Kamradt, 2023) test is widely used to evaluate the ability to extract critical information embedded in a large context. We follow the multi-needle evaluation protocol of LWM (Liu et al., 2024a) and Gemini 1.5 (Reid et al., 2024). We evaluate the 3B-size language models that support 64K input length (Section 3.3). We follow the evaluation protocol of (Bertsch et al., 2024) and use constrained decoding (Ratner et al., 2023). Specifically, the TREC (Hovy et al., 2001) dataset has 6 classes, TREC-fine (Hovy et al., 2001) has 50 classes, Banking-77 (Casanueva et al., 2020) has 77 classes, and Clinic-150 (Larson et al., 2019) has 150 classes. We evaluate contextual hallucination of the 3B-size language models (described in Appendix B) on text summarization and question answering. We follow the evaluation protocol of (Chuang et al., 2024). Summarization Table 4a presents hallucination evaluation on summarization datasets XSum (Narayan et al., 2018), CNN/DM (See et al., 2017), and Multi News (Fabbri et al., 2019). The Qasper (Dasigi et al., 2021) dataset is single-document question answering. In contrast, Hotpot QA (Yang et al., 2018) and 2Wiki Multihop QA (Ho et al., 2020) are multi-document question answering. All evaluation examples are from Long Bench (Bai et al., 2023). We evaluate the models across 8 math benchmarks: GSM-8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2020), MAWPS (Koncel-Kedziorski et al., 2016), CARP (Zhang et al., 2023), TABMWP (Lu et al., 2023), and College Math (Tang et al., 2024). In the first stage, we train both DIFF Transformer and Transformer for additional 20B tokens on synthetic math data (Li et al., 2024). In the second stage, we further distill the models from Open Thoughts-114K-Math (Open-R1, 2025). The dataset is filtered from Open Thoughts-114K (Open Thoughts, 2025) dataset. It consists of 89K math samples with the average length of 6K tokens. We apply supervised fine-tuning on the dataset to equip the models with o1-style reasoning capability. |
| Dataset Splits | No | The paper describes training durations and number of tokens used for training (e.g., "We train the models with 1T tokens", "evaluate the 3B language models...every 40B tokens"), and specific evaluation protocols (e.g., "evaluated using 50 samples" for Needle-In-A-Haystack). However, it does not explicitly provide the specific percentages or sample counts for training/validation/test splits for the datasets used in its experiments, nor does it cite standard splits for all datasets. For example, for LM Eval Harness, it mentions zero-shot results, which is an evaluation protocol, but not the training splits. |
| Hardware Specification | Yes | The experiments are conducted with Nvidia H100-80GB GPU cards. |
| Software Dependencies | Yes | We employ tiktoken-cl100k_base tokenizer. We use Adam W (Loshchilov & Hutter, 2019) optimizer with β = 0.9, 0.95. We can directly reuse Flash Attention (Dao et al., 2022) as described in Appendix A, which significantly improves model efficiency. The customized-flash-attention implementation is built on Flash Attention2 (Dao, 2023). With the recent release of Flash Attention3 (Shah et al., 2024), the gap of throughput can be further reduced. |
| Experiment Setup | Yes | Setup We follow a similar recipe as Stable LM-3B-4E1T (Tow et al., 2023). We set hidden size to 3072. The number of layers is 28. The head dimension d is 128. The number of heads is 24 for Transformer and 12 for DIFF Transformer, to align computation FLOPs and model size. The total parameter count is about 2.8B. The training sequence length is 4096. The batch size is 4M tokens. We train the models with 1T tokens. We use Adam W (Loshchilov & Hutter, 2019) optimizer with β = 0.9, 0.95. The maximal learning rate is 3.2e-4 with 1000 warmup steps and linearly decays to 1.28e-5. The training corpus also follows Stable LM-3B-4E1T (Tow et al., 2023). We employ tiktoken-cl100k_base tokenizer. Detailed hyperparameters are provided in Appendix D. Appendix D: Table 9 presents the detailed hyperparameters for the DIFF Transformer-3B models in Section 3.1. For Transformer-3B, the only difference is that there are 24 heads. Notice that both Transformer-3B and DIFF Transformer-3B have similar FLOPs. Params Values Layers 28 Hidden size 3072 FFN size 8192 Vocab size 100,288 Heads 12 Adam β (0.9, 0.95) LR 3.2 10 4 Batch size 4M Warmup steps 1000 Weight decay 0.1 Dropout 0.0 |