LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Authors: Shaolei Zhang, Qingkai Fang, Yang, Yang Feng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLa VA-Mini outperforms LLa VA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLa VA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory. |
| Researcher Affiliation | Academia | Shaolei Zhang1,3, Qingkai Fang1,3, Zhe Yang1,3, Yang Feng1,2,3 1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2Key Laboratory of AI Safety, Chinese Academy of Sciences 3University of Chinese Academy of Sciences, Beijing, China EMAIL, EMAIL |
| Pseudocode | No | The paper includes mathematical equations (e.g., Equation 2 and 3) and architectural diagrams (Figure 6), but it does not present any structured pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code: https://github.com/ictnlp/LLa VA-Mini; Model: https://huggingface.co/ ICTNLP/llava-mini-llama-3.1-8b |
| Open Datasets | Yes | Benchmarks We evaluate LLa VA-Mini on image and video understanding tasks. Experiments are conducted on 11 image benchmarks and 7 video benchmarks. Refer to Appendix C for details. Following the LLa VA framework (Liu et al., 2023b), we conduct experiments on 11 widely adopted benchmarks, including VQA-v2 (VQAv2) (Goyal et al., 2017), GQA (Hudson & Manning, 2019), Vis Wiz (Gurari et al., 2018), Science QA-IMG (Sci QA) (Lu et al., 2022), Text VQA (VQAT) (Singh et al., 2019), POPE (Li et al., 2023c), MME (Fu et al., 2024), MMBench (MMB) (Liu et al., 2024c), SEED-Bench (SEED) (Li et al., 2024b), LLa VA-Bench-in-the-Wild (LLa VAW) (Liu et al., 2023a), and MM-Vet (Yu et al., 2023), which cover a diverse range of visual tasks. |
| Dataset Splits | Yes | LLa VA-Mini uses the same training data as LLa VA-v1.5 (Liu et al., 2023b), using 558K caption data for pretraining and 665K instruction data for instruction tuning. The high-resolution version with 672*672 pixels (refer to Sec.4.2) is denoted as LLa VAMini-HD. To capture more visual details, the compressed hyperparameter C of LLa VA-Mini-HD is set to 8, i.e., compressing to 64 vision tokens. For video processing, LLa VA-Mini extracts 1 frame per second (1 fps) from the video and sets C = 1 to represent each frame with one vision token. To further explore the potential of LLa VA-Mini, we introduce a variant that uses the CLIP Vi TL/336px (Radford et al., 2021) as vision encoder and the advanced LLa MA-3.1-8B-Instruct (Dubey et al., 2024) as LLM backbone. During instruction tuning, we combine 665K image instruction data from LLa VA (Liu et al., 2023b), 100K video instruction data from Video-Chat GPT (Maaz et al., 2024), and part of open-source data (Li et al., 2024a), resulting in 3 million training samples. |
| Hardware Specification | Yes | As a result, LLa VA-Mini decreases inference latency of image understanding from 100 ms to 40 ms and also enables the processing of long videos exceeding 10,000 frames (over 3 hours) on an NVIDIA RTX 3090 with 24GB of memory. LLa VA-Mini is trained using 8 NVIDIA A800 GPUs. Latency is tested on the A100 without any engineering acceleration techniques. To demonstrate the scalability of model efficiency across different hardware platforms, we compute the inference latency of LLa VA-Mini on three hardware platforms: RTX 3090, A100, and A800. |
| Software Dependencies | No | The paper mentions 'calflops (Ye, 2023)' for calculating FLOPs, but it does not specify version numbers for other core software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions used for the experiments. |
| Experiment Setup | Yes | Configuration For a fair comparison, LLa VA-Mini employs the same configurations as LLa VAv1.5 (Liu et al., 2023b), using the CLIP Vi T-L/336px (Radford et al., 2021) as the vision encoder and Vicuna-v1.5-7B (Chiang et al., 2023) as the LLM backbone. The compressed hyperparameter C is set to 1, meaning vision tokens are compressed to one token. The number of modality prefusion layers Nfusion is set to 4. LLa VA-Mini uses the same training data as LLa VA-v1.5 (Liu et al., 2023b), using 558K caption data for pretraining and 665K instruction data for instruction tuning. The high-resolution version with 672*672 pixels (refer to Sec.4.2) is denoted as LLa VAMini-HD. To capture more visual details, the compressed hyperparameter C of LLa VA-Mini-HD is set to 8, i.e., compressing to 64 vision tokens. For video processing, LLa VA-Mini extracts 1 frame per second (1 fps) from the video and sets C = 1 to represent each frame with one vision token. Training details are provided in Appendix B, Table 9. Settings Stage1 Stage2 Vision-Language Pretraining Instruction Turning Vision Encoder Frozen Frozen Projection Trainable Trainable Large Language Model Frozen Trainable Compression N/A Trainable Modality Pre-fusion N/A Trainable Hyperparameters Batch Size 256 256 Learning Rate 1e-4 MM Learning Rate 1e-3 1e-5 Schedule Cosine decay Warmup Ratio 0.03 Optimizer Adam W Epoch 1 2 |