Hymba: A Hybrid-head Architecture for Small Language Models
Authors: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, ZIJIA CHEN, Ameya Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Celine Lin, Jan Kautz, Pavlo Molchanov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations and ablation studies demonstrate that Hymba not only establishes new state-of-the-art (SOTA) benchmark performance across a wide range of tasks but also achieves greater efficiency compared to transformers and previous hybrid models. We provide the benchmark with other representative small LMs in Fig. 1, with more comprehensive benchmarks in Fig. 6. For instance, in commonsense reasoning tasks, Hymba-1.5B can outperform Llama-3.2-3B with 1.32% higher average accuracy, while requiring 11.67 smaller cache size and being 3.49 faster. |
| Researcher Affiliation | Collaboration | Xin Dong1 , Yonggan Fu1,2 , Shizhe Diao1, Wonmin Byeon1, Zijia Chen1, Ameya Sunil Mahabaleshwarkar1, Shih-Yang Liu1,3, Matthijs Van Keirsbilck1, Min-Hung Chen1, Yoshi Suhara1, Yingyan (Celine) Lin1,2, Jan Kautz1, Pavlo Molchanov1 1NVIDIA 2Georgia Institute of Technology 3Hong Kong University of Science and Technology |
| Pseudocode | Yes | Algorithm 1: Forward Process of Hymba-1.5B Input: X = [x1, . . . , xn], where X R(n,d) are text input tokens. Model Configurations: Number of blocks: 32 Block indices with global attention: [1, 16, 32] // Three global attention KV reusing groups: [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15], [17, 18, 19], [20, 21], [22, 23], [24, 25], [26, 27], [28, 29], [30, 31] // share KV per group Model Forward: X0 = [R, X] = [r1, . . . , rm, x1, . . . , xn] // Prepend n meta tokens R R(m,d) for block-i in [1, . . . , 32] do if block-i in [1, 16, 32] then Xi = HYMBABLOCK-GA( Xi 1) // global attention else if block-i is the first block in its KV reusing group then Xi, KV i = HYMBABLOCK-SWA( Xi 1) // sliding window attention else Retrieve KV cache from the previous layer: KV i 1 Xi = HYMBABLOCK-SWA( Xi 1, KV i 1) // reuse KV end end end |
| Open Source Code | Yes | Models on Hugging Face: Hymba-1.5B-Base | Hymba-1.5B-Instruct |
| Open Datasets | Yes | We train Hymba-125M/350M/1.5B models using a mix of DCLM-Baseline-1.0 (Li et al., 2024), Smol LM-Corpus (Ben Allal et al., 2024), and a proprietary high-quality dataset, with 1T, 250B, and 50B tokens, respectively. |
| Dataset Splits | Yes | The training set of Role Bench is used for training, and the model is evaluated on two sub-tasks: instruction generalization (Inst. Gene.) and role generalization (Role. Gene.). |
| Hardware Specification | Yes | The throughput is measured on an NVIDIA A100 with a sequence length of 8k and a batch size of 128 using Py Torch. For models encountering out-of-memory (OOM) issues during throughput measurement, we halve the batch size until the OOM is resolved. This approach is used to measure the maximal achievable throughput without OOM. |
| Software Dependencies | No | The throughput is measured on an NVIDIA A100 with a sequence length of 8k and a batch size of 128 using Py Torch. ... We implement the finetuning and DPO training with the LMFlow toolkit (Diao et al., 2024). |
| Experiment Setup | Yes | We adopt the WSD learning rate scheduler (Hu et al., 2024) with three phases: (1) warmup steps set to 1% of the total steps, (2) a stable phase maintaining the peak learning rate of 3e-3, and (3) a decay phase reducing the learning rate to 1e-5 over 20% of the total steps, while gradually annealing to smaller, higher-quality datasets like Smol LM-Corpus and the internal dataset. We use a sequence length of 2K and a batch size of 2M tokens throughout the training process, which is conducted on 128 NVIDIA A100 GPUs. |