reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mixture of Hidden-Dimensions: Not All Hidden-States’ Dimensions are Needed in Transformer

Authors: Yilong Chen, Junyuan Shang, Zhenyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the effectiveness of MOHD, we pretrain Vanilla Transformers with 355M, 495M, and 1.13B parameters based on LLa MA s architecture (Touvron et al., 2023b), and their MOHD versions in both hidden dimension compression and expansion settings with scaling factors: 50%, 75%, 2 , 3 , and 4 . We evaluated these models on 10 NLP tasks, showing the advantages of the MOHD architecture. Results indicate that MOHD consistently outperforms Vanilla and Mixture of Experts Transformers with the same activated parameters across all model sizes. In the compression setting, MOHD reduces activated parameters by 50%, retaining 99% of original performance. In the expansion setting, MOHD keeps activated parameters constant while expanding hidden dimensions 4 , achieving up to an 8.37% relative performance improvement.
Researcher Affiliation	Collaboration	1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Baidu Inc. Correspondence to: Tingwen Liu <EMAIL>. Project lead: Junyuan Shang <EMAIL>.
Pseudocode	No	The paper provides detailed mathematical formulations and equations (e.g., Equation 1 to 18) for the proposed MOHD architecture and its components (sparsified FFN and Attention), as well as theoretical proofs. However, it does not include a distinct section or block explicitly labeled as "Pseudocode" or "Algorithm" with structured, step-by-step instructions in a code-like format.
Open Source Code	No	The paper states: "Our experimental framework utilizes the Sheared LLa MA codebase (Xia et al., 2023) implemented on the Composer package (Team, 2021)". This refers to external tools and frameworks used by the authors, not a direct release of their own source code for the MOHD methodology. There is no explicit statement from the authors about making their code publicly available or a link to a repository for the work described in this paper.
Open Datasets	Yes	Data. To pretrain MOHD models and baseline models, we employ the Red Pajama (Together AI, 2023), which parallels the LLa MA training data across seven domains: Common Crawl, C4, Git Hub, Wikipedia, Books, Ar Xiv, and Stack Exchange. This dataset comprises a validation set with 2 million tokens, a training set containing 50 billion tokens. Evaluation. We employed the lm-evaluation-harness (Gao et al., 2021) to evaluate our models. For common sense and reading comprehension tasks, we report 0-shot accuracy results for Sci Q (Welbl et al., 2017), PIQA (Bisk et al., 2020), Wino Grande (WG) (Sakaguchi et al., 2020), ARC Easy(ARC-E) (Clark et al., 2018b), and 10-shot Hella Swag (Hella.) (Zellers et al., 2019), alongside 25-shot accuracy for ARC Challenge (ARC-C) (Clark et al., 2018a). In the assessments of continued QA and text understanding, we report 0-shot accuracy for Logi QA (Liu et al., 2020), 32-shot Bool Q (Clark et al., 2019), and 0-shot LAMBADA (Lam.) (Paperno et al., 2016).
Dataset Splits	Yes	Data. To pretrain MOHD models and baseline models, we employ the Red Pajama (Together AI, 2023), which parallels the LLa MA training data across seven domains: Common Crawl, C4, Git Hub, Wikipedia, Books, Ar Xiv, and Stack Exchange. This dataset comprises a validation set with 2 million tokens, a training set containing 50 billion tokens. Evaluation. We employed the lm-evaluation-harness (Gao et al., 2021) to evaluate our models. For common sense and reading comprehension tasks, we report 0-shot accuracy results for Sci Q (Welbl et al., 2017), PIQA (Bisk et al., 2020), Wino Grande (WG) (Sakaguchi et al., 2020), ARC Easy(ARC-E) (Clark et al., 2018b), and 10-shot Hella Swag (Hella.) (Zellers et al., 2019), alongside 25-shot accuracy for ARC Challenge (ARC-C) (Clark et al., 2018a). In the assessments of continued QA and text understanding, we report 0-shot accuracy for Logi QA (Liu et al., 2020), 32-shot Bool Q (Clark et al., 2019), and 0-shot LAMBADA (Lam.) (Paperno et al., 2016).
Hardware Specification	Yes	Our experimental framework utilizes the Sheared LLa MA codebase (Xia et al., 2023) implemented on the Composer package (Team, 2021), and is executed on 8 NVIDIA A100 GPUs (80GB).
Software Dependencies	No	Our experimental framework utilizes the Sheared LLa MA codebase (Xia et al., 2023) implemented on the Composer package (Team, 2021). While "Composer package" is mentioned, no specific version number is provided. Other common software dependencies like Python, PyTorch, or CUDA versions are not specified.
Experiment Setup	Yes	Training. Our experimental framework utilizes the Sheared LLa MA codebase (Xia et al., 2023) implemented on the Composer package (Team, 2021), and is executed on 8 NVIDIA A100 GPUs (80GB). The models are trained with a sequence length of 4096, employing a global batch size of 256. MOHD models are trained for 50000 steps (50B token budget). The learning rates were set at 3e-4 for all parameters. The baselines and all MOHD models follow the same training setup, starting from random initialization and training on the same amount of data.