Vision-Language Instruction Tuning: A Review and Analysis
Authors: Chen Li, Yixiao Ge, Dian Li, Ying Shan
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By incorporating these characteristics as guiding principles into the existing VLIT data construction process, we conduct extensive experiments and verify their positive impact on the performance of tuned multi-modal LLMs. [...] Section 5 primarily includes the design, implementation, and discussion of the verification experiment. |
| Researcher Affiliation | Industry | Chen Li EMAIL, Yixiao Ge EMAIL, Dian Li EMAIL, Ying Shan EMAIL, ARC Lab, Tencent PCG Foundation Technology Center, Tencent PCG |
| Pseudocode | No | The paper describes the proposed pipeline in prose and illustrates it with a flowchart in Figure 4, but it does not contain any formally structured pseudocode blocks or algorithms. |
| Open Source Code | Yes | The code and dataset related to this paper have been open-sourced at https://github.com/palchenli/VL-Instruction-Tuning. |
| Open Datasets | Yes | The code and dataset related to this paper have been open-sourced at https://github.com/palchenli/VL-Instruction-Tuning. [...] Specifically, in data collection, we first select COCO 2014 (Lin et al., 2014) as the image source, and {caption, object, attribute, OCR, visual QA} as the selected sources of annotation data (Antol et al., 2015; Patterson & Hays, 2016; Veit et al., 2016). |
| Dataset Splits | No | The paper states: "In the quality evaluation process, to ensure fairness, we use the smallest dataset size as the scale for all the test VLIT data and randomly sample VLIT datasets larger than this scale." This describes a method for selecting evaluation data size but does not provide explicit train/validation/test splits for the constructed VLIT data or the existing VLIT datasets used for instruction tuning. |
| Hardware Specification | Yes | These MLLMs are all trained utilizing 8 Telsa V100 (32G) GPUs with the Python environment and other detailed settings (e.g., hyperparameters) of the three models can be found in Section A.3 in the Appendix. |
| Software Dependencies | No | The paper mentions a "Python environment" and specific MLLM libraries such as "LLaVA library", "LAVIS", and "Open Flamingo" but does not provide specific version numbers for these software components or the Python interpreter itself. |
| Experiment Setup | Yes | These MLLMs are all trained utilizing 8 Telsa V100 (32G) GPUs with the Python environment and other detailed settings (e.g., hyperparameters) of the three models can be found in Section A.3 in the Appendix. [...] Table 11, Table 12, and Table 13 respectively list all their hyperparameter settings during instruction tuning. |