Video Action Differencing

Authors: James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, trevor darrell, Serena Yeung

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Vid Diff Bench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on Vid Diff Bench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the Vid Diff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark1 and code2.
Researcher Affiliation Academia James Burgess1, Xiaohan Wang1, Yuhui Zhang1, Anita Rau1, Alejandro Lozano1, Lisa Dunlap2, Trevor Darrell2, Serena Yeung-Levy1 1Stanford, 2 UC Berkeley
Pseudocode No The paper describes methods in prose and lists steps for its workflow but does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code Yes To encourage future research in this new task, we release the benchmark1 and code2. ... 2Project page: http://jmhb0.github.io/viddiff ... The code for Vid Diff method is available at this Git Hub repo https://github.com/jmhb0/viddiff.
Open Datasets Yes To enable development on this new task, we first create Vid Diff Bench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. ... 1Benchmark: https://huggingface.co/datasets/jmhb/Vid Diff Bench ... Our benchmark consists of publicly available videos and our human-created annotations are freely available on Hugging Face Hub3. ... 3https://huggingface.co/datasets/jmhb/Vid Diff Bench
Dataset Splits Yes To account for varying levels of difficulty in Vid Diff Bench, we categorize actions into easy, medium, and hard splits. GPT-4o was used to assign actions to these splits based on descriptions, difference lists, and video lengths. The easy split includes simple movements like Fitness exercises, while medium and hard splits contain more complex actions like Ballsports, Diving, Music, and Surgery. This ensures that models are challenged across a range of difficulties, from basic movements to subtle, fine-grained comparisons.
Hardware Specification Yes Our method s runtime is less than one minute per video pair using an A6000 GPU for running CLIP inference Radford et al. (2021).
Software Dependencies Yes For our experiments, we benchmark large multimodal models (LMMs) that have demonstrated strong performance in video tasks. Specifically, we use top models from the Video-MME benchmark (Fu et al., 2024): GPT-4o (Achiam et al., 2023), Gemini-1.5-Pro (Reid et al., 2024), Claude 3.5 Sonnet Anthropic (2024), and the leading open-source models, Qwen2-VL-7B (Wang et al., 2024; Bai et al., 2023) and LLa VA-Video (Zhang et al., 2024). ... Our method, Vid Diff, is evaluated alongside these baselines, were the proposer LLM is gpt-4o-2024-08-06, the localizer embedding model is CLIP Vi T-big G-14, and frame differencer VLM is gpt-4o-2024-08-06.
Experiment Setup Yes For categories with shorter, finegrained actions (e.g., Fitness, Ballsports, and Diving), we sample frames at 4-6 fps, while for longer actions (e.g., Music and Surgery), we sample at 2 fps. ... Our method, Vid Diff, is evaluated alongside these baselines, were the proposer LLM is gpt-4o-2024-08-06, the localizer embedding model is CLIP Vi T-big G-14, and frame differencer VLM is gpt-4o-2024-08-06.