Video Action Differencing
Authors: James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, trevor darrell, Serena Yeung
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Vid Diff Bench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on Vid Diff Bench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the Vid Diff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark1 and code2. |
| Researcher Affiliation | Academia | James Burgess1, Xiaohan Wang1, Yuhui Zhang1, Anita Rau1, Alejandro Lozano1, Lisa Dunlap2, Trevor Darrell2, Serena Yeung-Levy1 1Stanford, 2 UC Berkeley |
| Pseudocode | No | The paper describes methods in prose and lists steps for its workflow but does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting. |
| Open Source Code | Yes | To encourage future research in this new task, we release the benchmark1 and code2. ... 2Project page: http://jmhb0.github.io/viddiff ... The code for Vid Diff method is available at this Git Hub repo https://github.com/jmhb0/viddiff. |
| Open Datasets | Yes | To enable development on this new task, we first create Vid Diff Bench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. ... 1Benchmark: https://huggingface.co/datasets/jmhb/Vid Diff Bench ... Our benchmark consists of publicly available videos and our human-created annotations are freely available on Hugging Face Hub3. ... 3https://huggingface.co/datasets/jmhb/Vid Diff Bench |
| Dataset Splits | Yes | To account for varying levels of difficulty in Vid Diff Bench, we categorize actions into easy, medium, and hard splits. GPT-4o was used to assign actions to these splits based on descriptions, difference lists, and video lengths. The easy split includes simple movements like Fitness exercises, while medium and hard splits contain more complex actions like Ballsports, Diving, Music, and Surgery. This ensures that models are challenged across a range of difficulties, from basic movements to subtle, fine-grained comparisons. |
| Hardware Specification | Yes | Our method s runtime is less than one minute per video pair using an A6000 GPU for running CLIP inference Radford et al. (2021). |
| Software Dependencies | Yes | For our experiments, we benchmark large multimodal models (LMMs) that have demonstrated strong performance in video tasks. Specifically, we use top models from the Video-MME benchmark (Fu et al., 2024): GPT-4o (Achiam et al., 2023), Gemini-1.5-Pro (Reid et al., 2024), Claude 3.5 Sonnet Anthropic (2024), and the leading open-source models, Qwen2-VL-7B (Wang et al., 2024; Bai et al., 2023) and LLa VA-Video (Zhang et al., 2024). ... Our method, Vid Diff, is evaluated alongside these baselines, were the proposer LLM is gpt-4o-2024-08-06, the localizer embedding model is CLIP Vi T-big G-14, and frame differencer VLM is gpt-4o-2024-08-06. |
| Experiment Setup | Yes | For categories with shorter, finegrained actions (e.g., Fitness, Ballsports, and Diving), we sample frames at 4-6 fps, while for longer actions (e.g., Music and Surgery), we sample at 2 fps. ... Our method, Vid Diff, is evaluated alongside these baselines, were the proposer LLM is gpt-4o-2024-08-06, the localizer embedding model is CLIP Vi T-big G-14, and frame differencer VLM is gpt-4o-2024-08-06. |