Long Context Transfer from Language to Vision
Authors: Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To effectively measure LMMs ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model s NIAH test. Our proposed Long Video Assistant (Long VA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, Long VA achieves state-of-the-art performance on Video-MME and MLVU among 7B-scale models by densely sampling more input frames. 5 Experiments: We primarily assess the long visual capability of Long VA on two benchmarks: V-NIAH (Section 5.1 and Video MME (Fu et al., 2024a) (Section 5.2). |
| Researcher Affiliation | Academia | Peiyuan Zhang EMAIL LMMs-Lab & S-Lab, Nanyang Technological University Kaichen Zhang EMAIL LMMs-Lab & S-Lab, Nanyang Technological University Bo Li EMAIL LMMs-Lab & S-Lab, Nanyang Technological University Guangtao Zeng EMAIL Singapore University of Technology and Design Jingkang Yang EMAIL LMMs-Lab & S-Lab, Nanyang Technological University Yuanhan Zhang EMAIL LMMs-Lab & S-Lab, Nanyang Technological University Ziyue Wang EMAIL S-Lab, Nanyang Technological University Hoaran Tan EMAIL S-Lab, Nanyang Technological University Chunyuan Li EMAIL LMMs-Lab Ziwei Liu EMAIL LMMs-Lab & S-Lab, Nanyang Technological University |
| Pseudocode | No | The paper describes methods and processes verbally and through conceptual diagrams (e.g., Figure 1, Figure 2, Figure 8) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Additionally, we commit to releasing all datasets and code with appropriate licenses and clear usage guidelines to discourage unethical applications. |
| Open Datasets | Yes | To effectively measure LMMs ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model s NIAH test. Our model, Long Video Assistant (Long VA), is capable of accurately retrieving visual information from 2000 frames or more than 200K visual tokens. Experiments show that additional frames during inference lead to improved performance on long video question-answering benchmarks, and Long VA achieves state-of-the-art performance among 7B models on the Video-MME (Fu et al., 2024a) and MLVU (Zhou et al., 2024a) dataset. We follow Xiong et al. (2023) to increase Ro PE (Su et al., 2023) base frequency during the continued pertaining and specifically set it to 1B. A constant learning rate of 1e-5 is maintained for a batch size of one million tokens across 1,000 training steps. Following Fu et al. (2024b), we construct the dataset used for long context training from Slimpajama (Cerebras, 2023) by upsampling documents longer than 4096 and keeping the domain mixture ratio unchanged. We trained our model using the same data recipe and two-stage training approach as LLaVA-1.6. We perform a lightweight Direct Preference Optimization (DPO) on the LLaVA-Hound-DPO (Zhang et al., 2024a) dataset. |
| Dataset Splits | No | The paper mentions training data and testing on benchmarks like Video-MME and MLVU, but it does not explicitly provide specific details about the training/test/validation splits for these datasets (e.g., exact percentages, sample counts, or citations to specific split methodologies beyond using a 'data recipe'). For example: 'we adopt a train short, test long protocol where we only use image-text data during training, but test on long videos.' and 'We trained our model using the same data recipe and two-stage training approach as LLaVA-1.6.' |
| Hardware Specification | Yes | 224K is the maximum we can fit with 8 A100-80G for Qwen-2-7B. The long context training can finish in 2 days with 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions specific models and techniques (e.g., Qwen2-7B-Instruct, Flash Attention-2, Ring Attention, RoPE) but does not provide specific version numbers for software libraries or environments required for replication (e.g., Python version, PyTorch version, CUDA version). |
| Experiment Setup | Yes | We use Qwen2-7B-Instruct (Team, 2024) as the backbone language model and perform continued pretraining with a context length of 224K over a total of 900M tokens. We follow Xiong et al. (2023) to increase Ro PE (Su et al., 2023) base frequency during the continued pertaining and specifically set it to 1B. A constant learning rate of 1e-5 is maintained for a batch size of one million tokens across 1,000 training steps. |