Mixture of Hidden-Dimensions: Not All Hidden-States’ Dimensions are Needed in Transformer
Authors: Yilong Chen, Junyuan Shang, Zhenyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effectiveness of MOHD, we pretrain Vanilla Transformers with 355M, 495M, and 1.13B parameters based on LLa MA s architecture (Touvron et al., 2023b), and their MOHD versions in both hidden dimension compression and expansion settings with scaling factors: 50%, 75%, 2 , 3 , and 4 . We evaluated these models on 10 NLP tasks, showing the advantages of the MOHD architecture. Results indicate that MOHD consistently outperforms Vanilla and Mixture of Experts Transformers with the same activated parameters across all model sizes. In the compression setting, MOHD reduces activated parameters by 50%, retaining 99% of original performance. In the expansion setting, MOHD keeps activated parameters constant while expanding hidden dimensions 4 , achieving up to an 8.37% relative performance improvement. |
| Researcher Affiliation | Collaboration | 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Baidu Inc. Correspondence to: Tingwen Liu <EMAIL>. Project lead: Junyuan Shang <EMAIL>. |
| Pseudocode | No | The paper provides detailed mathematical formulations and equations (e.g., Equation 1 to 18) for the proposed MOHD architecture and its components (sparsified FFN and Attention), as well as theoretical proofs. However, it does not include a distinct section or block explicitly labeled as "Pseudocode" or "Algorithm" with structured, step-by-step instructions in a code-like format. |
| Open Source Code | No | The paper states: "Our experimental framework utilizes the Sheared LLa MA codebase (Xia et al., 2023) implemented on the Composer package (Team, 2021)". This refers to external tools and frameworks used by the authors, not a direct release of their own source code for the MOHD methodology. There is no explicit statement from the authors about making their code publicly available or a link to a repository for the work described in this paper. |
| Open Datasets | Yes | Data. To pretrain MOHD models and baseline models, we employ the Red Pajama (Together AI, 2023), which parallels the LLa MA training data across seven domains: Common Crawl, C4, Git Hub, Wikipedia, Books, Ar Xiv, and Stack Exchange. This dataset comprises a validation set with 2 million tokens, a training set containing 50 billion tokens. Evaluation. We employed the lm-evaluation-harness (Gao et al., 2021) to evaluate our models. For common sense and reading comprehension tasks, we report 0-shot accuracy results for Sci Q (Welbl et al., 2017), PIQA (Bisk et al., 2020), Wino Grande (WG) (Sakaguchi et al., 2020), ARC Easy(ARC-E) (Clark et al., 2018b), and 10-shot Hella Swag (Hella.) (Zellers et al., 2019), alongside 25-shot accuracy for ARC Challenge (ARC-C) (Clark et al., 2018a). In the assessments of continued QA and text understanding, we report 0-shot accuracy for Logi QA (Liu et al., 2020), 32-shot Bool Q (Clark et al., 2019), and 0-shot LAMBADA (Lam.) (Paperno et al., 2016). |
| Dataset Splits | Yes | Data. To pretrain MOHD models and baseline models, we employ the Red Pajama (Together AI, 2023), which parallels the LLa MA training data across seven domains: Common Crawl, C4, Git Hub, Wikipedia, Books, Ar Xiv, and Stack Exchange. This dataset comprises a validation set with 2 million tokens, a training set containing 50 billion tokens. Evaluation. We employed the lm-evaluation-harness (Gao et al., 2021) to evaluate our models. For common sense and reading comprehension tasks, we report 0-shot accuracy results for Sci Q (Welbl et al., 2017), PIQA (Bisk et al., 2020), Wino Grande (WG) (Sakaguchi et al., 2020), ARC Easy(ARC-E) (Clark et al., 2018b), and 10-shot Hella Swag (Hella.) (Zellers et al., 2019), alongside 25-shot accuracy for ARC Challenge (ARC-C) (Clark et al., 2018a). In the assessments of continued QA and text understanding, we report 0-shot accuracy for Logi QA (Liu et al., 2020), 32-shot Bool Q (Clark et al., 2019), and 0-shot LAMBADA (Lam.) (Paperno et al., 2016). |
| Hardware Specification | Yes | Our experimental framework utilizes the Sheared LLa MA codebase (Xia et al., 2023) implemented on the Composer package (Team, 2021), and is executed on 8 NVIDIA A100 GPUs (80GB). |
| Software Dependencies | No | Our experimental framework utilizes the Sheared LLa MA codebase (Xia et al., 2023) implemented on the Composer package (Team, 2021). While "Composer package" is mentioned, no specific version number is provided. Other common software dependencies like Python, PyTorch, or CUDA versions are not specified. |
| Experiment Setup | Yes | Training. Our experimental framework utilizes the Sheared LLa MA codebase (Xia et al., 2023) implemented on the Composer package (Team, 2021), and is executed on 8 NVIDIA A100 GPUs (80GB). The models are trained with a sequence length of 4096, employing a global batch size of 256. MOHD models are trained for 50000 steps (50B token budget). The learning rates were set at 3e-4 for all parameters. The baselines and all MOHD models follow the same training setup, starting from random initialization and training on the same amount of data. |