reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Authors: Zhihao He, Hang Yu, Zi Gong, Shi Zhan Liu, Jianguo Li, Weiyao Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Rodimus+-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. We validate the effectiveness of Rodimus* (encompassing both Rodimus and Rodimus+) through thorough experimentation. In this section, we compare Rodimus* with other SOTA methods across various benchmarks, including language modeling and recall benchmarks. ... We conclude this section with a series of ablation studies.
Researcher Affiliation	Collaboration	Zhihao He1,2 , Hang Yu2 , Zi Gong2, Shizhan Liu2, Jianguo Li2 , Weiyao Lin1 1Shanghai Jiao Tong University, 2Ant Group
Pseudocode	No	No specific pseudocode or algorithm blocks are explicitly labeled in the paper. The paper focuses on mathematical formulations and experimental results rather than presenting a structured algorithmic description.
Open Source Code	Yes	Model code and pre-trained checkpoints are open-sourced at https://github.com/codefuse-ai/rodimus.
Open Datasets	Yes	The Wiki Text-103 language modeling dataset (Merity et al., 2022) consists of over 100 million tokens... All models are trained on subsets of Pile (Gao et al., 2020)... We also utilize curated datasets, including Fine Web (Penedo et al., 2024), Pile (Gao et al., 2020), etc., to train our Rodimus* on 150B tokens. These tasks include content analysis (LAMBADA (Paperno et al., 2016)), commonsense reasoning (Pi QA (Bisk et al., 2020) and Hella Swag (Zellers et al., 2019)), coreference resolution (Wino Grande (Sakaguchi et al., 2019)), reading comprehension (Open Book QA (Mihaylov et al., 2018)), and professional examinations (ARC-Easy and ARC-Challenge (Clark et al., 2018)). The MQAR task (Arora et al., 2024a)... The Needle Bench (Li et al., 2024)...
Dataset Splits	No	The paper refers to using standard datasets such as Wiki Text-103 and Pile, and evaluates using 'Test PPL' and 'Valid PPL', implying the use of standard training, validation, and test splits. For example, Section D.1 states, 'For the models trained on Wiki Text-103, ... All models are trained from scratch on the training dataset...'. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for these splits within the main text.
Hardware Specification	No	The paper does not explicitly provide specific hardware details (like GPU/CPU models, processor types, or memory amounts) used for running its experiments. While it mentions 'GPU memory usage' in relation to batch size adjustment, it lacks specific hardware models or configurations.
Software Dependencies	No	The paper mentions several software components, including 'FSDP (Zhao et al., 2023)', 'Hugging Face's byte-level BPE algorithm', 'lm-evaluation-harness (Gao et al., 2023)', and 'Flash Linear Attention' (with a GitHub link provided in a footnote). However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	The training settings are listed in Table 5. For the models trained on Wiki Text-103, the number of layers and dimensions are set to 6 and 512, respectively. ... Other experimental configurations are also the same across all models, including batch size, learning rate, and training iterations (see Appendix D.1). Table 5 provides: Tokenizer method BPE, Vocab size 50265, Sequence length 512, Total Batch size 256, Training steps 25000, Warmup steps 2000, Peak learing rate 5e-4, Min learning rate 1e-5, Optimizer Adam W, Adam β1 0.9, Adam β2 0.95, Weight decay 0.1, Gradient clipping 1.0. Additionally, Table 8 details configurations for scaling experiments, including parameters like n_layer, d_model, n_heads/d_head, Steps, Learning Rate, Batch Size, and Tokens. Table 9 specifies extra training settings for downstream tasks, such as Training context, Batch size, Max learning rate, and Min learning rate.