Gated Delta Networks: Improving Mamba2 with Delta Rule
Authors: Songlin Yang, Jan Kautz, Ali Hatamizadeh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed architecture, Gated Delta Net, consistently surpasses existing models like Mamba2 and Delta Net across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated Delta Net layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance. Our resulting architecture, Gated Delta Net, consistently outperforms both Mamba2 and Delta Net across a comprehensive suite of benchmarks, including language modeling, commonsense reasoning, in-context retrieval, length extrapolation, and long-context understanding. Building on these results, we also develop hybrid architectures that strategically combine Gated Delta Net layers with sliding window attention or Mamba2 layers, further enhancing both training efficiency and model performance. In Table 3, we present the language modeling perplexity and zero-shot accuracy on commonsense reasoning benchmarks for models with 400M and 1.3B parameters. Gated Delta Net consistently outperforms other linear models, including Ret Net, HGRN2, Mamba, Mamba2, and Delta Net, at both scales. As expected, the hybrid variant further enhances performance. |
| Researcher Affiliation | Collaboration | Songlin Yang MIT CSAIL EMAIL Jan Kautz NVIDIA EMAIL Ali Hatamizadeh NVIDIA EMAIL |
| Pseudocode | No | The paper describes the algorithms and formulations using mathematical equations and descriptive text, such as in Section 3.3 "ALGORITHM: HARDWARE-EFFICIENT CHUNKWISE TRAINING", but it does not contain explicit pseudocode or algorithm blocks with structured steps labeled as "Algorithm" or "Pseudocode". |
| Open Source Code | Yes | Code: https://github.com/NVlabs/Gated Delta Net |
| Open Datasets | Yes | All models are trained under identical conditions with 1.3B parameters on 100B tokens sampled from the Fine Web-Edu dataset (Penedo et al., 2024). For synthetic tasks, we utilize the Needle-In-A-Haystack Single (NIAH-S) benchmark suite from RULER (Hsieh et al., 2024), which includes three increasingly complex tasks: S-NIAH-1 (passkey retrieval), S-NIAH2 (numerical needle in haystack), and S-NIAH-3 (word-based needle in haystack). For real-world tasks, following Arora et al. (2024b), we evaluate on diverse datasets: SWDE (Lockard et al., 2019) for structured HTML relation extraction, FDA (Arora et al., 2023b) for PDF key-value retrieval, and several question-answering datasets including SQu AD (Rajpurkar et al., 2018), Trivia QA (Joshi et al., 2017a), Drop (Dua et al., 2019), and NQ (Kwiatkowski et al., 2019). We evaluate our model on multiple commonsense reasoning benchmarks: PIQA (Bisk et al., 2020), Hella Swag (Hella.; Zellers et al., 2019), Wino Grande (Wino.; Sakaguchi et al., 2020), ARC-easy (ARC-e) and ARC-challenge (ARC-c) (Clark et al., 2018), SIQA (Sap et al., 2019), Bool Q (Clark et al., 2019), Wikitext (Wiki.; Merity et al., 2017), and LAMBADA (LMB.; Paperno et al., 2016). We evaluate on 14 tasks from Longbench (Bai et al., 2023), encompassing: narrative comprehension (Narrative QA (Koˇciský et al., 2018)), scientific understanding (Qasper QA (Dasigi et al., 2021)), multi-hop reasoning (Multi Field QA, Hotpot QA (Yang et al., 2018), 2Wiki Multi QA (Ho et al., 2020), Musique (Trivedi et al., 2022)), document summarization (Gov Report (Huang et al., 2021), QMSum (Zhong et al., 2021), Multi News (Fabbri et al., 2019)), and various specialized tasks (TRec (Li & Roth, 2002), Trivia QA (Joshi et al., 2017b), Sam Sum (Gliwa et al., 2019), LCC (Guo et al., 2023), and Repo Bench-P (Liu et al., 2023)). |
| Dataset Splits | No | The paper states that models are "trained under identical conditions with 1.3B parameters on 100B tokens sampled from the Fine Web-Edu dataset (Penedo et al., 2024)" and "For sequence modeling, we set the training length to 4K tokens". It then refers to various external benchmarks for evaluation. However, it does not explicitly specify how the 100B tokens from Fine Web-Edu were split into training, validation, or test sets for their primary model development, nor does it explicitly detail the splits for the evaluation benchmarks, instead referring to the standard evaluation practices via `lm-evaluation-harness`. |
| Hardware Specification | Yes | Figure 3: Training throughput comparison of 1.3B models on a single H100 GPU. |
| Software Dependencies | No | The paper mentions using the 'Llama2 tokenizer' and 'Adam W optimizer' along with 'lm-evaluation-harness (Gao et al., 2021)'. However, it does not provide specific version numbers for these software components or any other libraries used. |
| Experiment Setup | Yes | For fair comparison, all models are trained under identical conditions with 1.3B parameters on 100B tokens sampled from the Fine Web-Edu dataset (Penedo et al., 2024). We use the Adam W optimizer with a peak learning rate of 4e-4, weight decay of 0.1, and gradient clipping of 1.0. The learning rate follows a cosine annealing schedule with a 1B token warm-up period and batch size of 0.5M tokens. All models employ the Llama2 tokenizer with a vocabulary size of 32,000. For sequence modeling, we set the training length to 4K tokens, with Samba and our hybrid models using a sliding window size of 2K. |