R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental R2-T2 consistently and significantly improves state-of-the-art LMMs performance on challenging multimodal benchmarks of diverse tasks, without training any parameters in the base model. Our code can be accessed here. 1 Introduction Mixture-of-Experts (Mo E) have achieved remarkable success in scaling up the size and capacity of large language and multimodal models (LLMs and LMMs) (Shazeer et al., 2017) without (significantly) increasing the inference cost. Specifically, it allows us to increase the total number of ex- 4 Experiments
Researcher Affiliation Academia 1Department of Computer Science, Johns Hopkins University, Baltimore, USA 2Department of Computer Science, University of Maryland, College Park, USA. Correspondence to: Tianyi Zhou <EMAIL>.
Pseudocode No The paper describes methods using mathematical equations and prose in Section 3 and its subsections (Gradient Descent, Kernel Regression, Mode Finding) and illustrates them with Figure 3, but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code can be accessed here.
Open Datasets Yes Table 1 summarizes the reference datasets and evaluation benchmarks, including their dataset sizes. See Appendix B for details. Appendix B Evaluation Benchmarks and Reference Datasets: We conduct evaluations using a diverse set of reference datasets and task-specific benchmarks (Liang et al., 2025). For general visual understanding, we use four reference datasets: VQA-V2 (Goyal et al., 2017), Visual7W (Zhu et al., 2016), CLEVR (Johnson et al., 2017), and COCO-QA (Lu et al., 2016).
Dataset Splits Yes To ensure a balanced evaluation, we randomly sample 5,000 instances from datasets exceeding this size. TQA (Kembhavi et al., 2017): ... The dataset is split into training, validation, and test sets, with no content overlap, ensuring robust evaluation of models ability to integrate and reason over multimodal information.
Hardware Specification Yes We measure inference latency on RTX A6000 to assess the computational overhead of R2-T2.
Software Dependencies No The text describes methods, models, and hyperparameters (e.g., 'cosine annealing schedule', 'Gaussian kernel', 'NV-Embed-V2 embedding model') but does not specify any programming languages, libraries, or frameworks with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup Yes The selected hyperparameters are as follows: cosine annealing schedule with a learning rate ranging from 1 10 2 to 1 10 5, neighborhood selection is performed using k NN with k = 5, the number of NGD steps is fixed at 10, the Gaussian kernel is used for kernel-based methods, and NV-Embed-V2 is adopted as the embedding model. These values are applied uniformly across all evaluated tasks.