Re-Imagining Multimodal Instruction Tuning: A Representation View
Authors: Yiyang Liu, James Liang, Ruixiang Tang, Yugyung Lee, MAJID RABBANI, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior. |
| Researcher Affiliation | Collaboration | 1University of Missouri Kansas City 2Rochester Institute of Technology 3U.S. Naval Research Laboratory 4Rutgers University 5U.S. DEVCOM Army Research Laboratory 6University of California, Davis 7Meta AI |
| Pseudocode | No | The paper describes methods using mathematical equations (e.g., equation 1, 2, 3, 4) but does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Our full implementation is available at https://github.com/ comeandcode/MRT. |
| Open Datasets | Yes | All the datasets included in our study are publicly available (i.e., Vision-Flan, MME, Text-VQA, Visual Spatial Reasoning (VSR), CIFAR-10/100, MNIST, SNLI-VE, POPE), and all the models are publicly available (see Appendix S7 for Asset License and Consent). |
| Dataset Splits | Yes | We conduct multimodal instruction tuning on Vision-Flan (Xu et al., 2024), a humanannotated multimodal instruction tuning dataset with 191 diverse tasks. Following common practice (Shen et al., 2024), we employ the scaled-down version containing up to 1, 000 instances per task, resulting in a total of 191, 105 instances. ... For controllability experiment 4.3, we trained our two sets of representation editors ψ1 and ψ2 on the CIFAR-10 (Krizhevsky et al., 2009) training dataset for 1 epoch and evaluate the control performance on the testing dataset. ... Specifically, we select 8,017 instances as the training set and 1,189 instances as the validation set on textual tokens beginning with what is the n , where n represents an image attribute (e.g., name, color, brand). |
| Hardware Specification | Yes | Experiments are conducted on NVIDIA A100-40GB GPUs. |
| Software Dependencies | No | MRT is implemented in Pytorch (Paszke et al., 2019). The paper mentions Pytorch and its authors but does not provide a specific version number for the software. |
| Experiment Setup | Yes | Table S2: Hyperparameters and Configurations. Learning Rate 6e-4 Batch Size 128 Epoch 3 Lr Scheduler linear Warmup Ratio 0.03 Activation Type bfloat16 Optimizer Adam |