Temporal Reasoning Transfer from Text to Video
Authors: Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, Qi Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our paper takes a different approach by decomposing Video LLMs into two parts and asking a fundamental question: What is the bottleneck of this limitation? Is it due to limitations in the vision encoder, or, surprisingly, shortcomings in the LLM itself? We conduct probing experiments using synthesized videos for basic temporal-related video question-answering (QA) tasks, allowing full control over temporal aspects. ... Our experiments reveal a striking contrast in temporal reasoning capabilities between different components of Video LLMs. Probe classifiers trained on video embeddings achieve near-perfect accuracy (> 90% in most cases)... To establish a connection between textual temporal reasoning ability and video comprehension, we adopt the fine-grained temporal understanding benchmark, Temp Compass (Liu et al., 2024b) (Multiple-Choice subset)... For a comprehensive assessment of long-form video understanding capabilities, we choose two challenging benchmarks: MLVU (Zhou et al., 2024) and Video-MME (Fu et al., 2024). Tables 2, 3, 4 present evaluation results... |
| Researcher Affiliation | Academia | 1The University of Hong Kong 2Peking University 3University of California, San Diego EMAIL EMAIL EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Textual temporal QA generation for the Order aspect using GPT-4-turbo. Algorithm 2: Textual temporal QA generation for the Order aspect using templates. Algorithm 3: Textual temporal QA generation for the Attribute aspect. Algorithm 4: Textual temporal QA generation for the Temporal Referring aspect. Algorithm 5: Textual temporal QA generation for the Temporal Grounding aspect. |
| Open Source Code | No | The paper includes a project page link (https://video-t3.github.io) in a footnote, but it does not explicitly state that the source code for the methodology described in the paper is available at this link or in supplementary materials. The criteria for 'Yes' require a specific repository link or an explicit statement of code release for *their* work, which is not present. |
| Open Datasets | Yes | The contextual information of our textual QA data are sourced from the detailed image captions from the LLaVA-ReCap-558K dataset 3. These captions are generated by the LLaVA-Next-34B model (Liu et al., 2024a) based on images from the LCS-558K (a subset of the LAION/CC/SBU dataset). We only retain the first sentence of the detailed image captions to form the caption pool Cpool. ... For Brightness, we collected static images from the COCO dataset (Lin et al., 2014) and adjusted pixel values to synthesize videos with brightness variations. |
| Dataset Splits | Yes | To create distinct training and test sets for the classifier probe, we collected videos of each category with both black and white backgrounds. The black background videos were used for training the probe, while the white background videos were reserved for evaluation. ... For each task, we also create a validation set consisting of 500 samples for later verification. ... According to our textual validation accuracy, we set the ratio of textual temporal QA and original data to 1:2 and the total samples to 200k. |
| Hardware Specification | Yes | The model is trained on the corresponding dataset for one epoch, a process that can be completed within 5 hours using 8 H100 GPUs. |
| Software Dependencies | No | The paper mentions using Adam as the optimizer and fine-tuning with the ms-swift framework (Zhao et al., 2024) with the default LoRA setup, but it does not provide specific version numbers for these or any other key software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | This involves using Adam (Kingma & Ba, 2015) as the optimizer, with learning rates of 2e-6 for the visual encoder and 1e-5 for the rest part of the model. For the exploration of textual temporal reasoning transfer, we use 22k samples for fine-tuning across different augmentation datasets. ... We set the ratio of textual temporal QA and original data to 1:2 and the total samples to 200k. ... The model is trained on the corresponding dataset for one epoch... Table 13: Training hyper-parameters for the classifier probe. Temporal Aspect Learning Rate Batch Size #Epoch Optimizer Order (Two Events) 5e-5 64 15 Adam Order (Three Events) 5e-5 64 120 Adam Attribute 5e-5 64 80 Adam Temporal Referring 5e-5 64 120 Adam Temporal Grounding 5e-5 64 120 Adam |