Vision-Language Instruction Tuning: A Review and Analysis

Authors: Chen Li, Yixiao Ge, Dian Li, Ying Shan

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By incorporating these characteristics as guiding principles into the existing VLIT data construction process, we conduct extensive experiments and verify their positive impact on the performance of tuned multi-modal LLMs. [...] Section 5 primarily includes the design, implementation, and discussion of the verification experiment.
Researcher Affiliation Industry Chen Li EMAIL, Yixiao Ge EMAIL, Dian Li EMAIL, Ying Shan EMAIL, ARC Lab, Tencent PCG Foundation Technology Center, Tencent PCG
Pseudocode No The paper describes the proposed pipeline in prose and illustrates it with a flowchart in Figure 4, but it does not contain any formally structured pseudocode blocks or algorithms.
Open Source Code Yes The code and dataset related to this paper have been open-sourced at https://github.com/palchenli/VL-Instruction-Tuning.
Open Datasets Yes The code and dataset related to this paper have been open-sourced at https://github.com/palchenli/VL-Instruction-Tuning. [...] Specifically, in data collection, we first select COCO 2014 (Lin et al., 2014) as the image source, and {caption, object, attribute, OCR, visual QA} as the selected sources of annotation data (Antol et al., 2015; Patterson & Hays, 2016; Veit et al., 2016).
Dataset Splits No The paper states: "In the quality evaluation process, to ensure fairness, we use the smallest dataset size as the scale for all the test VLIT data and randomly sample VLIT datasets larger than this scale." This describes a method for selecting evaluation data size but does not provide explicit train/validation/test splits for the constructed VLIT data or the existing VLIT datasets used for instruction tuning.
Hardware Specification Yes These MLLMs are all trained utilizing 8 Telsa V100 (32G) GPUs with the Python environment and other detailed settings (e.g., hyperparameters) of the three models can be found in Section A.3 in the Appendix.
Software Dependencies No The paper mentions a "Python environment" and specific MLLM libraries such as "LLaVA library", "LAVIS", and "Open Flamingo" but does not provide specific version numbers for these software components or the Python interpreter itself.
Experiment Setup Yes These MLLMs are all trained utilizing 8 Telsa V100 (32G) GPUs with the Python environment and other detailed settings (e.g., hyperparameters) of the three models can be found in Section A.3 in the Appendix. [...] Table 11, Table 12, and Table 13 respectively list all their hyperparameter settings during instruction tuning.