TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Authors: Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, Limin Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our Time Suite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and Video MME, respectively. In addition, Video Chat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. |
| Researcher Affiliation | Academia | 1Nanjing University 2Shanghai AI Laboratory 3SIAT, Chinese Academy of Sciences 4Zhejiang University 5University of Science and Technology of China 6Shanghai Jiao Tong University 7Fudan University 8Shanghai Innovation Institute. The email EMAIL further indicates an academic affiliation. |
| Pseudocode | Yes | Algorithm 1 Py Torch snippet of TAPE. |
| Open Source Code | Yes | Our code and dataset are available at https://github.com/OpenGVLab/TimeSuite. |
| Open Datasets | Yes | Finally, we collect a comprehensive grounding-centric instruction tuning dataset for post-training our designed MLLMs, which is composed of 349K high-quality annotations covering 9 tasks. Based on this new dataset, we are able to perform grounded tuning with detailed captions on our proposed MLLMs (coined as Video Chat-T). We collect and clean several existing high-quality grounding-centric datasets (Ren et al., 2024; Huang et al., 2024a;b), and create two new datasets, resulting in the Time Pro. Temporal Video Grounding involves identifying the start and end times of video content based on a natural language query (Anne Hendricks et al., 2017; Oncescu et al., 2021; Zala et al., 2023). Dense Video Captioning requires detecting events within a video and providing corresponding timestamps and descriptions (Krishna et al., 2017; Huang et al., 2020; Zhou et al., 2018). Video Summarization focuses on determining key frames or clips in the form of timestamps rather than semantic summaries (Song et al., 2015; Gygli et al., 2014). Step Localization aims to segment and describe important steps in a long video (Tang et al., 2019; Zala et al., 2023). Transcribed Speech Generation predicts speech content and its timestamps from visual signals (Zellers et al., 2022). Reasoning Temporal Localization combines timestamps with explanatory answers (Huang et al., 2024b). Multi-format Temporal Grounding includes single-turn and multi-turn dialogues with diverse question types (Huang et al., 2024a). Highlight Detection identifies the most significant moments in a video based on a query (Lei et al., 2021a). |
| Dataset Splits | Yes | We fine-tune the model for 3 epochs using the Time Pro with 349K instances and a general QA task dataset with 82K instances. We use Egoschema (Mangalam et al., 2023) and Video MME (Fu et al., 2024) to evaluate the long video capabilities of Video Chat-T. In conjunction with our proposed architectural improvements, we incremental fine-tune Video Chat2 using only 432K data points. We use MVBench (Li et al., 2024b) to evaluate the general short video understanding capabilities of Video Chat-T. This task aims to identify the start and end timestamps of the video content described by the query sentence, using Charades-STA as the evaluation benchmark. We use QVHighlights as the evaluation benchmark. |
| Hardware Specification | Yes | All experiments are conducted on 16 A100 GPUs. |
| Software Dependencies | No | The paper mentions using UMT-L (Li et al., 2023c) and Mistral-7B (Jiang et al., 2023) as the video encoder and LLM, respectively, and refers to a 'Py Torch snippet' for TAPE, but specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | Table 8 lists the hyperparameters used during different epochs of the training process. In the first epoch, we used a larger number of input frames and froze the TAPE. At the beginning of the second epoch, we unfrozo the TAPE and fixed the model s input frames to 128. Following the settings of Video Chat2, we integrated the lora module into the LLM and applied flash attention to accelerate the training process. config epoch1 epoch2&3 input frame 192 128 max text length 1536 1024 freeze TAPE True False learning rate 2e-5 1.5e-5 input resolution 224 clip frame 8 merge lenth 4 QFormer token (per clip) 96 lora rank 16 lora alpha 32 lora dropout 0.1 batch size (per GPU) 2 optimizer Adam W optimizer momentum 0.9, 0.999 weight decay 0.02 learning rate schedule cosine decay |