A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models
Authors: Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a series of comparative experiments to evaluate the performances of Mo E-Lo RA across various downstream tasks (Zhang et al., 2024b; Luo et al., 2025b) including Question Answering, the GLUE Benchmark, and the Vision-Language task. ... We implement and examine our rescaling approach for Mo E-Lo RA under a series of foundation models, illustrating our effectiveness across various tasks. ... Finally, to lend support to our theoretical foundation, we conduct an ablation study by assessing our forwarding revisions only under a classic optimizer without Riemannian preconditioners support. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Technology, Tsinghua University, Beijing, China; 2Computer School, Beijing Information Science and Technology University, Beijing, China; 3School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 Engineering Alternative Solution of Gate-based Rescaling Method def forward(self, x, ...): ... # compute gate values gvs = ... ... # execute each activated expert for exp_id in activated_experts: A = self.As[exp_id] B = self.Bs[exp_id] gv = gvs[:,:,exp_id] exp_out = B(A(x)) sqrt_gv = (gv**0.5).detach() # update 1 w_exp_out = sqrt_gv*exp_out+(gv-sqrt_gv)*exp_out.detach() # update 2 result = result + w_exp_out ... |
| Open Source Code | Yes | Source code is available at https://github. com/THUDM/Mo ELo RA_Riemannian. |
| Open Datasets | Yes | We evaluate our proposed method on several questionanswering benchmarks, including Science QA (Lu et al., 2022), Commonsense QA (Talmor et al., 2019), Open Book QA (Mihaylov et al., 2018) and SIQA (Sap et al., 2019). ... GLUE (Wang et al., 2019) ... For evaluation, Visual7W (Zhu et al., 2016) and VMCBench (Zhang et al., 2025b) datasets are employed |
| Dataset Splits | Yes | For VMCBench, we only use their dev set since their test set is not labeled. We take 900 of all the 1,000 labeled samples as training samples, while the rest 100 are for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | For most experiments, unless otherwise specified, we construct a mixture of Lo RAs modules with a total of 20 experts, a rank of 4 for each expert, and a selection of top-10 experts activated each time. ... During training, we follow a linear decay learning-rate scheduler. We assign a relatively smaller learning rate to gate module compared to other trainable components, to achieve a stable training behavior. ... Table 12. Default experimental details implemented throughout this paper. All experiments follow this configuration unless they specify their particular settings... |