reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	R2-T2 consistently and significantly improves state-of-the-art LMMs performance on challenging multimodal benchmarks of diverse tasks, without training any parameters in the base model. Our code can be accessed here. 1 Introduction Mixture-of-Experts (Mo E) have achieved remarkable success in scaling up the size and capacity of large language and multimodal models (LLMs and LMMs) (Shazeer et al., 2017) without (significantly) increasing the inference cost. Specifically, it allows us to increase the total number of ex- 4 Experiments
Researcher Affiliation	Academia	1Department of Computer Science, Johns Hopkins University, Baltimore, USA 2Department of Computer Science, University of Maryland, College Park, USA. Correspondence to: Tianyi Zhou <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical equations and prose in Section 3 and its subsections (Gradient Descent, Kernel Regression, Mode Finding) and illustrates them with Figure 3, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code can be accessed here.
Open Datasets	Yes	Table 1 summarizes the reference datasets and evaluation benchmarks, including their dataset sizes. See Appendix B for details. Appendix B Evaluation Benchmarks and Reference Datasets: We conduct evaluations using a diverse set of reference datasets and task-specific benchmarks (Liang et al., 2025). For general visual understanding, we use four reference datasets: VQA-V2 (Goyal et al., 2017), Visual7W (Zhu et al., 2016), CLEVR (Johnson et al., 2017), and COCO-QA (Lu et al., 2016).
Dataset Splits	Yes	To ensure a balanced evaluation, we randomly sample 5,000 instances from datasets exceeding this size. TQA (Kembhavi et al., 2017): ... The dataset is split into training, validation, and test sets, with no content overlap, ensuring robust evaluation of models ability to integrate and reason over multimodal information.
Hardware Specification	Yes	We measure inference latency on RTX A6000 to assess the computational overhead of R2-T2.
Software Dependencies	No	The text describes methods, models, and hyperparameters (e.g., 'cosine annealing schedule', 'Gaussian kernel', 'NV-Embed-V2 embedding model') but does not specify any programming languages, libraries, or frameworks with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup	Yes	The selected hyperparameters are as follows: cosine annealing schedule with a learning rate ranging from 1 10 2 to 1 10 5, neighborhood selection is performed using k NN with k = 5, the number of NGD steps is fixed at 10, the Gaussian kernel is used for kernel-based methods, and NV-Embed-V2 is adopted as the embedding model. These values are applied uniformly across all evaluated tasks.