Wolf: Dense Video Captioning with a World Summarization Framework
Authors: Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Linxi Fan, Yuke Zhu, Jan Kautz, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, Marco Pavone
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate caption quality, we introduce Cap Score, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA-1.5, Cog Agent) and commercial solutions (Gemini-Pro1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves Cap Score both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. |
| Researcher Affiliation | Collaboration | 1NVIDIA 2UC Berkeley 3MIT 4UT Austin 5University of Toronto 6Stanford University |
| Pseudocode | No | The paper describes the 'Wolf Framework' in Section 3 and provides an overview in Figure 1. It details the steps involved in cascading visual summarization and LLM-based video summarization in paragraph form, but it does not present any formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | The code, data and benchmark has been been release in https://wolfv0.github.io/. Continuous efforts and improvements will be made to refine the Wolf Dataset, codebase, and Cap Score. |
| Open Datasets | Yes | We introduce four benchmark datasets. These datasets include autonomous driving, general scenes from Pexels, and robotics videos, along with human-annotated captions, referred to as the Wolf Dataset. ... These include two autonomous driving video captioning datasets based on the open-sourced Nu Scenes (Caesar et al., 2019) dataset ... a general daily video captioning dataset from Pexels 1, and a robot manipulation video captioning dataset from an open-source robot learning dataset (Padalkar et al., 2023). ... The code, data and benchmark has been been release in https://wolfv0.github.io/. |
| Dataset Splits | Yes | To further verify the effectiveness of Wolf, we finetune VILA-1.5-7B based on Wolf s captions on 4,785 normal Nuscenes videos and evaluate it on 500 highly interactive Nuscenes videos, which have much more difficult captions and complex scenarios. |
| Hardware Specification | Yes | The training is performed on 8x A100 GPUs with batch size 8. |
| Software Dependencies | No | The paper mentions several models (e.g., GPT-4, Llama 3.2, VILA-1.5-7B, Gemini-Pro-1.5, Cog Agent) and general frameworks (PyTorch, HuggingFace pipeline) but does not provide specific version numbers for these software components to ensure reproducibility. For example, it cites 'PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation (Ansel et al., 2024)' but does not specify a PyTorch version used in their experiments. |
| Experiment Setup | Yes | Data Setup. We use four sets of data to evaluate the validity of Wolf: 1) 500 Nuscences Interactive Videos; 2) 4,785 Nuscences Normal Videos; 3) 473 general videos and 4) 100 robotics videos. We extract 2 frames per second for autonomous driving videos. For robotics videos, we extract 1 frame per second. For short videos that sample less frames, we will increase fps to capture more details. ... As for the prompt for each captioning model, we use elaborate on the visual and narrative elements of the video in detail, particularly the motion behavior". ... The training is performed on 8x A100 GPUs with batch size 8. We set the learning rate to 10 4 with warmup strategy. No weight decay is applied. |