mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Authors: Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct evaluations on 21 benchmarks that cover single/multi-image, and short/long video understanding. m PLUG-Owl3 achieves competitive performance with the stateof-the-art methods while reducing inference time and memory usage by 87.8% and 48.5% in average. Moreover, we propose a Distractor Resistance evaluation to assess the ability of models to maintain focus amidst distractions. |
| Researcher Affiliation | Industry | Jiabo Ye1 Haiyang Xu1 Haowei Liu Anwen Hu Ming Yan2 Qi Qian Ji Zhang Fei Huang Jingren Zhou Alibaba Group EMAIL |
| Pseudocode | No | The paper includes architectural diagrams and workflow illustrations, but it does not contain any sections explicitly labeled as "Pseudocode" or "Algorithm", nor does it present structured code-like procedural steps. |
| Open Source Code | Yes | https://github.com/X-PLUG/m PLUG-Owl |
| Open Datasets | Yes | We follow the m PLUG-Owl2 (Ye et al., 2024) to collect the pre-training datasets and randomly sample a subset consists of 41 million image-text pairs for pre-training. In the multi-image pre-training stage, we collected three types of data to enhance the model s multiimage understanding capabilities: (1) Interleaved data. We utilize sources such as MMDU (Liu et al., 2024c) and M4-Instruction (Li et al., 2024) for multi-image data. ... (3) Video data. We adopt annotated data from Share GPTVideo (Zhang et al., 2024b), which includes 900K caption entries and 240K question-answering instances. We also incorporate Chinese and English video caption data from VATEX (Wang et al., 2019). |
| Dataset Splits | Yes | We adopt a three-stage training approach for m PLUG-Owl3. Initially, we pre-train m PLUG-Owl3 using image-text pairs to achieve robust multimodal alignment. In the second stage, we leverage diverse datasets that include image and video captions to enhance the model s ability to understand multiple images. Finally, we fine-tune m PLUG-Owl3 using a mixture of supervised data, encompassing tasks involving both single and multiple images, to ensure comprehensive performance. The statistics of the datasets, the training settings and data processing details can be found in Appendix A. We train three sizes of m PLUG-Owl3, based on Qwen2 with sizes of 0.5B, 1.5B, and 7B. All three models share the same visual encoder. ... Specifically, we take samples from the MMBench dev set. For each test sample, we randomly select N 1 images from the original MMBench dev set as distractor and construct the model input in the format of Image 1: <|image|> Image 2: <|image|> ... Image N: <|image|>. In Image X, {question}, where N = 1, 5, 10, 20, 50, 100, 200, 400 and X denotes the index of the image corresponding to the question. We use the Circular Eval to measure the accuracy scores. |
| Hardware Specification | Yes | For LLa VA-Next-Interleave, we input 8 frames, while for m PLUG-Owl3, we input 128 frames, which are the maximum numbers of images that can be accommodated by the two models on a V100-32G. |
| Software Dependencies | No | The paper mentions using "Siglip-400m (Zhai et al., 2023) as the visual encoder and Qwen2 (Yang et al., 2024) as the language model." However, it does not provide specific version numbers for these models or any other software libraries or frameworks used in the implementation. |
| Experiment Setup | Yes | We adopt a three-stage training approach for m PLUG-Owl3. ... The statistics of the datasets, the training settings and data processing details can be found in Appendix A. ... Table 8: The training settings across three stages: Pretraining, Multi-Image Training, and Self Supervised Finetuning. Learning Rate (Max, Min) (1e-3, 1e-5) (2e-5, 1e-7) (2e-5, 1e-7). Global Batch Size 2048 1024 1024. Training Steps 20K 3K 11K. Warmup ratio 0.03. Trainable Modules Linear Projection Visual KV Projection Adaptive Gate Linear Projection Full Language Model Linear Projection Full Language Model. Model Resolution 384^2 up to 384^2 6 up to 384^2 6. Sequence Length 768 4096 4096. Accelerating Precision Mixed-precision FP16/BF16. Ze RO Optimization Zero-1. Gradient Checkpointing No. Yes. Yes. Model Parallel TP=1 TP=4 TP=4. |