reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Re-Imagining Multimodal Instruction Tuning: A Representation View

Authors: Yiyang Liu, James Liang, Ruixiang Tang, Yugyung Lee, MAJID RABBANI, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.
Researcher Affiliation	Collaboration	1University of Missouri Kansas City 2Rochester Institute of Technology 3U.S. Naval Research Laboratory 4Rutgers University 5U.S. DEVCOM Army Research Laboratory 6University of California, Davis 7Meta AI
Pseudocode	No	The paper describes methods using mathematical equations (e.g., equation 1, 2, 3, 4) but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our full implementation is available at https://github.com/ comeandcode/MRT.
Open Datasets	Yes	All the datasets included in our study are publicly available (i.e., Vision-Flan, MME, Text-VQA, Visual Spatial Reasoning (VSR), CIFAR-10/100, MNIST, SNLI-VE, POPE), and all the models are publicly available (see Appendix S7 for Asset License and Consent).
Dataset Splits	Yes	We conduct multimodal instruction tuning on Vision-Flan (Xu et al., 2024), a humanannotated multimodal instruction tuning dataset with 191 diverse tasks. Following common practice (Shen et al., 2024), we employ the scaled-down version containing up to 1, 000 instances per task, resulting in a total of 191, 105 instances. ... For controllability experiment 4.3, we trained our two sets of representation editors ψ1 and ψ2 on the CIFAR-10 (Krizhevsky et al., 2009) training dataset for 1 epoch and evaluate the control performance on the testing dataset. ... Specifically, we select 8,017 instances as the training set and 1,189 instances as the validation set on textual tokens beginning with what is the n , where n represents an image attribute (e.g., name, color, brand).
Hardware Specification	Yes	Experiments are conducted on NVIDIA A100-40GB GPUs.
Software Dependencies	No	MRT is implemented in Pytorch (Paszke et al., 2019). The paper mentions Pytorch and its authors but does not provide a specific version number for the software.
Experiment Setup	Yes	Table S2: Hyperparameters and Configurations. Learning Rate 6e-4 Batch Size 128 Epoch 3 Lr Scheduler linear Warmup Ratio 0.03 Activation Type bfloat16 Optimizer Adam