LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Authors: Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun MA, Chunyuan Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, LLa VA-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https: //github.com/LLa VA-VL/LLa VA-Ne XT. |
| Researcher Affiliation | Collaboration | 1 Byte Dance 2 HKUST 3 CUHK 4 NTU Core contributor |
| Pseudocode | No | The paper describes the approach and techniques used, but does not present any specific pseudocode or algorithm blocks. It focuses on experimental results and model capabilities. |
| Open Source Code | Yes | Code is available at https: //github.com/LLa VA-VL/LLa VA-Ne XT. |
| Open Datasets | Yes | We compile a high-quality training dataset, M4-Instruct, with 1177.6 samples to empower LMMs with the M4 capabilities, which spans 4 primary domains (multi-image, video, 3D, and single-image) with 14 tasks and 41 datasets. We also curate LLa VA-Interleave Bench, a diverse set of benchmarks to evaluate the multi-image performance, including 7 newly collected and 13 existing in/out-domain benchmarks. For multi-image evaluation, we adopt the proposed LLa VA-Interleave Bench covering comprehensive in-domain and out-domain tasks. For video evaluation, we utilize the existing NEx T-QA (57), MVBench (30), Video Detailed Description (VDD) (67), and Activity Net-QA (Act) (59). For 3D evaluation, we select Scan QA (3), two tasks from 3D-LLM (16), i.e., 3D-assisted Dialogue and Task Decomposition, and also curate two new test set from nu Scenes VQA (6) and ALFRED (48). Table 15: M4-Instruct detailed datasets. Table 16: LLa VA-Interleave Bench detailed datasets. |
| Dataset Splits | Yes | To empower all-round multi-image capabilities, we meticulously curate a comprehensive training dataset including 1177.6K instances, termed M4-Instruct, widely spanning multi-image, multiframe, and multi-view scenarios with 14 tasks and 41 datasets, along with multi-patch data to preserve basic single-image performance. For the single-image data, we randomly sample 40% of the stage-2 fine-tuning data from LLa VA-Ne XT (24), which aims to preserve the single-image capacity. We present a data overview of the benchmark in Figure 3, and the detailed data statistics in Table 16. In detail, we categorize multi-image tasks into two classes: In-domain Evaluation includes tasks that have been seen during our training, designed to verify the model performance within familiar scenarios. We adopt 5 newly curated multiimage tasks corresponding to training datasets, and 2 existing benchmarks, Q-Bench (56) and NLVR2 (50), with 12.9K in total. Out-domain Evaluation involves tasks that don t overlap with training scenarios, aiming to reveal the generalization capacity of LMMs. We construct 2 new tasks for multi-image mathematical (Math Verse (65)) and scientific (Sci Verse (13)) comprehension, and utilize 3 existing benchmarks, Mantis-Eval (19), BLINK (10), and MMMU (60), with 4.1K in total. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running the experiments. It only mentions the architecture and LLM sizes. |
| Software Dependencies | Yes | Following the same architecture in LLa VA-Ne XT (24), our LLa VA-Interleave adopts Qwen 1.5 (5) as the base LLM with 0.5B, 7B and 14B parameters, Sig LIP400M (62) (384 384) as the vision encoder, and a two-layer MLP as the projection layer. |
| Experiment Setup | Yes | In this section, we introduce several key techniques during the interleaved visual instruction tuning of LLa VA-Interleave. For architecture designs, we follow LLa VA-Ne XT (24) to adopt the most general framework, i.e., a vision encoder (62), an intermediate projector, and an LLM (4). Then, we consider the following techniques to achieve improved multi-image performance. Technique 1: Continue training from single-image models. Technique 2: Mixed Interleaved data formats during training. Technique 3: Combining different data scenarios improves individual task performance. Following the same architecture in LLa VA-Ne XT (24), our LLa VA-Interleave adopts Qwen 1.5 (5) as the base LLM with 0.5B, 7B and 14B parameters, Sig LIP400M (62) (384 384) as the vision encoder, and a two-layer MLP as the projection layer. Similar to LLa VA-NEXT-Video, we adopt a Pooling to 1/4 strategy for which we pool the width and heighs of feature maps to 1/2 therefore reducing the number to totals to 1/4. During training, we sample 10 frames for videos. |