reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hymba: A Hybrid-head Architecture for Small Language Models

Authors: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, ZIJIA CHEN, Ameya Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Celine Lin, Jan Kautz, Pavlo Molchanov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluations and ablation studies demonstrate that Hymba not only establishes new state-of-the-art (SOTA) benchmark performance across a wide range of tasks but also achieves greater efficiency compared to transformers and previous hybrid models. We provide the benchmark with other representative small LMs in Fig. 1, with more comprehensive benchmarks in Fig. 6. For instance, in commonsense reasoning tasks, Hymba-1.5B can outperform Llama-3.2-3B with 1.32% higher average accuracy, while requiring 11.67 smaller cache size and being 3.49 faster.
Researcher Affiliation	Collaboration	Xin Dong1 , Yonggan Fu1,2 , Shizhe Diao1, Wonmin Byeon1, Zijia Chen1, Ameya Sunil Mahabaleshwarkar1, Shih-Yang Liu1,3, Matthijs Van Keirsbilck1, Min-Hung Chen1, Yoshi Suhara1, Yingyan (Celine) Lin1,2, Jan Kautz1, Pavlo Molchanov1 1NVIDIA 2Georgia Institute of Technology 3Hong Kong University of Science and Technology
Pseudocode	Yes	Algorithm 1: Forward Process of Hymba-1.5B Input: X = [x1, . . . , xn], where X R(n,d) are text input tokens. Model Configurations: Number of blocks: 32 Block indices with global attention: [1, 16, 32] // Three global attention KV reusing groups: [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15], [17, 18, 19], [20, 21], [22, 23], [24, 25], [26, 27], [28, 29], [30, 31] // share KV per group Model Forward: X0 = [R, X] = [r1, . . . , rm, x1, . . . , xn] // Prepend n meta tokens R R(m,d) for block-i in [1, . . . , 32] do if block-i in [1, 16, 32] then Xi = HYMBABLOCK-GA( Xi 1) // global attention else if block-i is the first block in its KV reusing group then Xi, KV i = HYMBABLOCK-SWA( Xi 1) // sliding window attention else Retrieve KV cache from the previous layer: KV i 1 Xi = HYMBABLOCK-SWA( Xi 1, KV i 1) // reuse KV end end end
Open Source Code	Yes	Models on Hugging Face: Hymba-1.5B-Base \| Hymba-1.5B-Instruct
Open Datasets	Yes	We train Hymba-125M/350M/1.5B models using a mix of DCLM-Baseline-1.0 (Li et al., 2024), Smol LM-Corpus (Ben Allal et al., 2024), and a proprietary high-quality dataset, with 1T, 250B, and 50B tokens, respectively.
Dataset Splits	Yes	The training set of Role Bench is used for training, and the model is evaluated on two sub-tasks: instruction generalization (Inst. Gene.) and role generalization (Role. Gene.).
Hardware Specification	Yes	The throughput is measured on an NVIDIA A100 with a sequence length of 8k and a batch size of 128 using Py Torch. For models encountering out-of-memory (OOM) issues during throughput measurement, we halve the batch size until the OOM is resolved. This approach is used to measure the maximal achievable throughput without OOM.
Software Dependencies	No	The throughput is measured on an NVIDIA A100 with a sequence length of 8k and a batch size of 128 using Py Torch. ... We implement the finetuning and DPO training with the LMFlow toolkit (Diao et al., 2024).
Experiment Setup	Yes	We adopt the WSD learning rate scheduler (Hu et al., 2024) with three phases: (1) warmup steps set to 1% of the total steps, (2) a stable phase maintaining the peak learning rate of 3e-3, and (3) a decay phase reducing the learning rate to 1e-5 over 20% of the total steps, while gradually annealing to smaller, higher-quality datasets like Smol LM-Corpus and the internal dataset. We use a sequence length of 2K and a batch size of 2M tokens throughout the training process, which is conducted on 128 NVIDIA A100 GPUs.