LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Authors: Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun MA, Chunyuan Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, LLa VA-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https: //github.com/LLa VA-VL/LLa VA-Ne XT.
Researcher Affiliation Collaboration 1 Byte Dance 2 HKUST 3 CUHK 4 NTU Core contributor
Pseudocode No The paper describes the approach and techniques used, but does not present any specific pseudocode or algorithm blocks. It focuses on experimental results and model capabilities.
Open Source Code Yes Code is available at https: //github.com/LLa VA-VL/LLa VA-Ne XT.
Open Datasets Yes We compile a high-quality training dataset, M4-Instruct, with 1177.6 samples to empower LMMs with the M4 capabilities, which spans 4 primary domains (multi-image, video, 3D, and single-image) with 14 tasks and 41 datasets. We also curate LLa VA-Interleave Bench, a diverse set of benchmarks to evaluate the multi-image performance, including 7 newly collected and 13 existing in/out-domain benchmarks. For multi-image evaluation, we adopt the proposed LLa VA-Interleave Bench covering comprehensive in-domain and out-domain tasks. For video evaluation, we utilize the existing NEx T-QA (57), MVBench (30), Video Detailed Description (VDD) (67), and Activity Net-QA (Act) (59). For 3D evaluation, we select Scan QA (3), two tasks from 3D-LLM (16), i.e., 3D-assisted Dialogue and Task Decomposition, and also curate two new test set from nu Scenes VQA (6) and ALFRED (48). Table 15: M4-Instruct detailed datasets. Table 16: LLa VA-Interleave Bench detailed datasets.
Dataset Splits Yes To empower all-round multi-image capabilities, we meticulously curate a comprehensive training dataset including 1177.6K instances, termed M4-Instruct, widely spanning multi-image, multiframe, and multi-view scenarios with 14 tasks and 41 datasets, along with multi-patch data to preserve basic single-image performance. For the single-image data, we randomly sample 40% of the stage-2 fine-tuning data from LLa VA-Ne XT (24), which aims to preserve the single-image capacity. We present a data overview of the benchmark in Figure 3, and the detailed data statistics in Table 16. In detail, we categorize multi-image tasks into two classes: In-domain Evaluation includes tasks that have been seen during our training, designed to verify the model performance within familiar scenarios. We adopt 5 newly curated multiimage tasks corresponding to training datasets, and 2 existing benchmarks, Q-Bench (56) and NLVR2 (50), with 12.9K in total. Out-domain Evaluation involves tasks that don t overlap with training scenarios, aiming to reveal the generalization capacity of LMMs. We construct 2 new tasks for multi-image mathematical (Math Verse (65)) and scientific (Sci Verse (13)) comprehension, and utilize 3 existing benchmarks, Mantis-Eval (19), BLINK (10), and MMMU (60), with 4.1K in total.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running the experiments. It only mentions the architecture and LLM sizes.
Software Dependencies Yes Following the same architecture in LLa VA-Ne XT (24), our LLa VA-Interleave adopts Qwen 1.5 (5) as the base LLM with 0.5B, 7B and 14B parameters, Sig LIP400M (62) (384 384) as the vision encoder, and a two-layer MLP as the projection layer.
Experiment Setup Yes In this section, we introduce several key techniques during the interleaved visual instruction tuning of LLa VA-Interleave. For architecture designs, we follow LLa VA-Ne XT (24) to adopt the most general framework, i.e., a vision encoder (62), an intermediate projector, and an LLM (4). Then, we consider the following techniques to achieve improved multi-image performance. Technique 1: Continue training from single-image models. Technique 2: Mixed Interleaved data formats during training. Technique 3: Combining different data scenarios improves individual task performance. Following the same architecture in LLa VA-Ne XT (24), our LLa VA-Interleave adopts Qwen 1.5 (5) as the base LLM with 0.5B, 7B and 14B parameters, Sig LIP400M (62) (384 384) as the vision encoder, and a two-layer MLP as the projection layer. Similar to LLa VA-NEXT-Video, we adopt a Pooling to 1/4 strategy for which we pool the width and heighs of feature maps to 1/2 therefore reducing the number to totals to 1/4. During training, we sample 10 frames for videos.