mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Authors: Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct evaluations on 21 benchmarks that cover single/multi-image, and short/long video understanding. m PLUG-Owl3 achieves competitive performance with the stateof-the-art methods while reducing inference time and memory usage by 87.8% and 48.5% in average. Moreover, we propose a Distractor Resistance evaluation to assess the ability of models to maintain focus amidst distractions.
Researcher Affiliation Industry Jiabo Ye1 Haiyang Xu1 Haowei Liu Anwen Hu Ming Yan2 Qi Qian Ji Zhang Fei Huang Jingren Zhou Alibaba Group EMAIL
Pseudocode No The paper includes architectural diagrams and workflow illustrations, but it does not contain any sections explicitly labeled as "Pseudocode" or "Algorithm", nor does it present structured code-like procedural steps.
Open Source Code Yes https://github.com/X-PLUG/m PLUG-Owl
Open Datasets Yes We follow the m PLUG-Owl2 (Ye et al., 2024) to collect the pre-training datasets and randomly sample a subset consists of 41 million image-text pairs for pre-training. In the multi-image pre-training stage, we collected three types of data to enhance the model s multiimage understanding capabilities: (1) Interleaved data. We utilize sources such as MMDU (Liu et al., 2024c) and M4-Instruction (Li et al., 2024) for multi-image data. ... (3) Video data. We adopt annotated data from Share GPTVideo (Zhang et al., 2024b), which includes 900K caption entries and 240K question-answering instances. We also incorporate Chinese and English video caption data from VATEX (Wang et al., 2019).
Dataset Splits Yes We adopt a three-stage training approach for m PLUG-Owl3. Initially, we pre-train m PLUG-Owl3 using image-text pairs to achieve robust multimodal alignment. In the second stage, we leverage diverse datasets that include image and video captions to enhance the model s ability to understand multiple images. Finally, we fine-tune m PLUG-Owl3 using a mixture of supervised data, encompassing tasks involving both single and multiple images, to ensure comprehensive performance. The statistics of the datasets, the training settings and data processing details can be found in Appendix A. We train three sizes of m PLUG-Owl3, based on Qwen2 with sizes of 0.5B, 1.5B, and 7B. All three models share the same visual encoder. ... Specifically, we take samples from the MMBench dev set. For each test sample, we randomly select N 1 images from the original MMBench dev set as distractor and construct the model input in the format of Image 1: <|image|> Image 2: <|image|> ... Image N: <|image|>. In Image X, {question}, where N = 1, 5, 10, 20, 50, 100, 200, 400 and X denotes the index of the image corresponding to the question. We use the Circular Eval to measure the accuracy scores.
Hardware Specification Yes For LLa VA-Next-Interleave, we input 8 frames, while for m PLUG-Owl3, we input 128 frames, which are the maximum numbers of images that can be accommodated by the two models on a V100-32G.
Software Dependencies No The paper mentions using "Siglip-400m (Zhai et al., 2023) as the visual encoder and Qwen2 (Yang et al., 2024) as the language model." However, it does not provide specific version numbers for these models or any other software libraries or frameworks used in the implementation.
Experiment Setup Yes We adopt a three-stage training approach for m PLUG-Owl3. ... The statistics of the datasets, the training settings and data processing details can be found in Appendix A. ... Table 8: The training settings across three stages: Pretraining, Multi-Image Training, and Self Supervised Finetuning. Learning Rate (Max, Min) (1e-3, 1e-5) (2e-5, 1e-7) (2e-5, 1e-7). Global Batch Size 2048 1024 1024. Training Steps 20K 3K 11K. Warmup ratio 0.03. Trainable Modules Linear Projection Visual KV Projection Adaptive Gate Linear Projection Full Language Model Linear Projection Full Language Model. Model Resolution 384^2 up to 384^2 6 up to 384^2 6. Sequence Length 768 4096 4096. Accelerating Precision Mixed-precision FP16/BF16. Ze RO Optimization Zero-1. Gradient Checkpointing No. Yes. Yes. Model Parallel TP=1 TP=4 TP=4.