A Comprehensive Overhaul of Multimodal Assistant with Small Language Models
Authors: Minjie Zhu, Yichen Zhu, Ning Liu, Xin Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our thorough empirical research leads us to several findings that diverge from the conventional wisdom established by prior studies on Multimodal Large Language Models. We evaluate our model variants on academic-task-oriented benchmarks (left) as well as instruction-following benchmarks (middle). Experimental Results on Visual Question Answering Benchmarks. We evaluate the visual question answering abilities on VQAv2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018), SQAI (Lu et al. 2022) and VQAT (Singh, Natarajan et al. 2019). As shown in Table 2, Mipha-3B achieves the highest performance in 2 out of the 5 benchmarks. |
| Researcher Affiliation | Collaboration | Minjie Zhu1*, Yichen Zhu2* , Ning Liu2, Xin Liu1, Zhiyuan Xu2, Chaomin Shen1 , Yaxin Peng3 1 East China Normal University 2 Midea Group 3 Shanghai University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/zhuyiche/llava-phi |
| Open Datasets | Yes | We conduct empirical studies on a collection of both academic-task-oriented benchmarks and recent benchmarks specifically proposed for instruction-following MLLMs, totaling 8 benchmarks. For academic-task-oriented benchmarks, VQAv2 (Goyal et al. 2017) and GQA (Hudson and Manning 2019) evaluate the model s visual perception capabilities on open-ended short answers. Science QA (Lu et al. 2022) is used to evaluate the zero-shot generalization on scientific question answering. Text VQA (Singh, Natarajan et al. 2019) contains text-rich visual question answering. We also employ recent benchmarks proposed for instruction-following MLLMs. POPE (Li et al. 2023b) evaluates MLLM s degree of hallucination on three sampled subsets of COCO (Lin, Maire et al. 2014). MME (Fu et al. 2023) Benchmark evaluates MLLMs perception and cognition capabilities, and MMBench (Liu et al. 2023d) evaluates the model s answer robustness with allaround shuffling on multiple-choice answers. MM-Vet (Yu et al. 2023) evaluates MLLM s capabilities in engaging in visual conversations on a diverse range of tasks. |
| Dataset Splits | No | The paper evaluates on various benchmarks like VQAv2, GQA, etc., but does not explicitly provide details on how the datasets were split into training, validation, and test sets within the text. It relies on the implied standard splits of these datasets without stating them explicitly. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | For the Lo RA setup, we configure r to be 128 and α to be 256. |