Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Authors: Zhihao He, Hang Yu, Zi Gong, Shi Zhan Liu, Jianguo Li, Weiyao Lin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Rodimus+-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. We validate the effectiveness of Rodimus* (encompassing both Rodimus and Rodimus+) through thorough experimentation. In this section, we compare Rodimus* with other SOTA methods across various benchmarks, including language modeling and recall benchmarks. ... We conclude this section with a series of ablation studies.
Researcher Affiliation Collaboration Zhihao He1,2 , Hang Yu2 , Zi Gong2, Shizhan Liu2, Jianguo Li2 , Weiyao Lin1 1Shanghai Jiao Tong University, 2Ant Group
Pseudocode No No specific pseudocode or algorithm blocks are explicitly labeled in the paper. The paper focuses on mathematical formulations and experimental results rather than presenting a structured algorithmic description.
Open Source Code Yes Model code and pre-trained checkpoints are open-sourced at https://github.com/codefuse-ai/rodimus.
Open Datasets Yes The Wiki Text-103 language modeling dataset (Merity et al., 2022) consists of over 100 million tokens... All models are trained on subsets of Pile (Gao et al., 2020)... We also utilize curated datasets, including Fine Web (Penedo et al., 2024), Pile (Gao et al., 2020), etc., to train our Rodimus* on 150B tokens. These tasks include content analysis (LAMBADA (Paperno et al., 2016)), commonsense reasoning (Pi QA (Bisk et al., 2020) and Hella Swag (Zellers et al., 2019)), coreference resolution (Wino Grande (Sakaguchi et al., 2019)), reading comprehension (Open Book QA (Mihaylov et al., 2018)), and professional examinations (ARC-Easy and ARC-Challenge (Clark et al., 2018)). The MQAR task (Arora et al., 2024a)... The Needle Bench (Li et al., 2024)...
Dataset Splits No The paper refers to using standard datasets such as Wiki Text-103 and Pile, and evaluates using 'Test PPL' and 'Valid PPL', implying the use of standard training, validation, and test splits. For example, Section D.1 states, 'For the models trained on Wiki Text-103, ... All models are trained from scratch on the training dataset...'. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for these splits within the main text.
Hardware Specification No The paper does not explicitly provide specific hardware details (like GPU/CPU models, processor types, or memory amounts) used for running its experiments. While it mentions 'GPU memory usage' in relation to batch size adjustment, it lacks specific hardware models or configurations.
Software Dependencies No The paper mentions several software components, including 'FSDP (Zhao et al., 2023)', 'Hugging Face's byte-level BPE algorithm', 'lm-evaluation-harness (Gao et al., 2023)', and 'Flash Linear Attention' (with a GitHub link provided in a footnote). However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes The training settings are listed in Table 5. For the models trained on Wiki Text-103, the number of layers and dimensions are set to 6 and 512, respectively. ... Other experimental configurations are also the same across all models, including batch size, learning rate, and training iterations (see Appendix D.1). Table 5 provides: Tokenizer method BPE, Vocab size 50265, Sequence length 512, Total Batch size 256, Training steps 25000, Warmup steps 2000, Peak learing rate 5e-4, Min learning rate 1e-5, Optimizer Adam W, Adam β1 0.9, Adam β2 0.95, Weight decay 0.1, Gradient clipping 1.0. Additionally, Table 8 details configurations for scaling experiments, including parameters like n_layer, d_model, n_heads/d_head, Steps, Learning Rate, Batch Size, and Tokens. Table 9 specifies extra training settings for downstream tasks, such as Training context, Batch size, Max learning rate, and Min learning rate.