Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
Authors: Zhihao He, Hang Yu, Zi Gong, Shi Zhan Liu, Jianguo Li, Weiyao Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Rodimus+-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. We validate the effectiveness of Rodimus* (encompassing both Rodimus and Rodimus+) through thorough experimentation. In this section, we compare Rodimus* with other SOTA methods across various benchmarks, including language modeling and recall benchmarks. ... We conclude this section with a series of ablation studies. |
| Researcher Affiliation | Collaboration | Zhihao He1,2 , Hang Yu2 , Zi Gong2, Shizhan Liu2, Jianguo Li2 , Weiyao Lin1 1Shanghai Jiao Tong University, 2Ant Group |
| Pseudocode | No | No specific pseudocode or algorithm blocks are explicitly labeled in the paper. The paper focuses on mathematical formulations and experimental results rather than presenting a structured algorithmic description. |
| Open Source Code | Yes | Model code and pre-trained checkpoints are open-sourced at https://github.com/codefuse-ai/rodimus. |
| Open Datasets | Yes | The Wiki Text-103 language modeling dataset (Merity et al., 2022) consists of over 100 million tokens... All models are trained on subsets of Pile (Gao et al., 2020)... We also utilize curated datasets, including Fine Web (Penedo et al., 2024), Pile (Gao et al., 2020), etc., to train our Rodimus* on 150B tokens. These tasks include content analysis (LAMBADA (Paperno et al., 2016)), commonsense reasoning (Pi QA (Bisk et al., 2020) and Hella Swag (Zellers et al., 2019)), coreference resolution (Wino Grande (Sakaguchi et al., 2019)), reading comprehension (Open Book QA (Mihaylov et al., 2018)), and professional examinations (ARC-Easy and ARC-Challenge (Clark et al., 2018)). The MQAR task (Arora et al., 2024a)... The Needle Bench (Li et al., 2024)... |
| Dataset Splits | No | The paper refers to using standard datasets such as Wiki Text-103 and Pile, and evaluates using 'Test PPL' and 'Valid PPL', implying the use of standard training, validation, and test splits. For example, Section D.1 states, 'For the models trained on Wiki Text-103, ... All models are trained from scratch on the training dataset...'. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for these splits within the main text. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details (like GPU/CPU models, processor types, or memory amounts) used for running its experiments. While it mentions 'GPU memory usage' in relation to batch size adjustment, it lacks specific hardware models or configurations. |
| Software Dependencies | No | The paper mentions several software components, including 'FSDP (Zhao et al., 2023)', 'Hugging Face's byte-level BPE algorithm', 'lm-evaluation-harness (Gao et al., 2023)', and 'Flash Linear Attention' (with a GitHub link provided in a footnote). However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | The training settings are listed in Table 5. For the models trained on Wiki Text-103, the number of layers and dimensions are set to 6 and 512, respectively. ... Other experimental configurations are also the same across all models, including batch size, learning rate, and training iterations (see Appendix D.1). Table 5 provides: Tokenizer method BPE, Vocab size 50265, Sequence length 512, Total Batch size 256, Training steps 25000, Warmup steps 2000, Peak learing rate 5e-4, Min learning rate 1e-5, Optimizer Adam W, Adam β1 0.9, Adam β2 0.95, Weight decay 0.1, Gradient clipping 1.0. Additionally, Table 8 details configurations for scaling experiments, including parameters like n_layer, d_model, n_heads/d_head, Steps, Learning Rate, Batch Size, and Tokens. Table 9 specifies extra training settings for downstream tasks, such as Training context, Batch size, Max learning rate, and Min learning rate. |