reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Video Action Differencing

Authors: James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, trevor darrell, Serena Yeung

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Vid Diff Bench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on Vid Diff Bench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the Vid Diff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark1 and code2.
Researcher Affiliation	Academia	James Burgess1, Xiaohan Wang1, Yuhui Zhang1, Anita Rau1, Alejandro Lozano1, Lisa Dunlap2, Trevor Darrell2, Serena Yeung-Levy1 1Stanford, 2 UC Berkeley
Pseudocode	No	The paper describes methods in prose and lists steps for its workflow but does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code	Yes	To encourage future research in this new task, we release the benchmark1 and code2. ... 2Project page: http://jmhb0.github.io/viddiff ... The code for Vid Diff method is available at this Git Hub repo https://github.com/jmhb0/viddiff.
Open Datasets	Yes	To enable development on this new task, we first create Vid Diff Bench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. ... 1Benchmark: https://huggingface.co/datasets/jmhb/Vid Diff Bench ... Our benchmark consists of publicly available videos and our human-created annotations are freely available on Hugging Face Hub3. ... 3https://huggingface.co/datasets/jmhb/Vid Diff Bench
Dataset Splits	Yes	To account for varying levels of difficulty in Vid Diff Bench, we categorize actions into easy, medium, and hard splits. GPT-4o was used to assign actions to these splits based on descriptions, difference lists, and video lengths. The easy split includes simple movements like Fitness exercises, while medium and hard splits contain more complex actions like Ballsports, Diving, Music, and Surgery. This ensures that models are challenged across a range of difficulties, from basic movements to subtle, fine-grained comparisons.
Hardware Specification	Yes	Our method s runtime is less than one minute per video pair using an A6000 GPU for running CLIP inference Radford et al. (2021).
Software Dependencies	Yes	For our experiments, we benchmark large multimodal models (LMMs) that have demonstrated strong performance in video tasks. Specifically, we use top models from the Video-MME benchmark (Fu et al., 2024): GPT-4o (Achiam et al., 2023), Gemini-1.5-Pro (Reid et al., 2024), Claude 3.5 Sonnet Anthropic (2024), and the leading open-source models, Qwen2-VL-7B (Wang et al., 2024; Bai et al., 2023) and LLa VA-Video (Zhang et al., 2024). ... Our method, Vid Diff, is evaluated alongside these baselines, were the proposer LLM is gpt-4o-2024-08-06, the localizer embedding model is CLIP Vi T-big G-14, and frame differencer VLM is gpt-4o-2024-08-06.
Experiment Setup	Yes	For categories with shorter, finegrained actions (e.g., Fitness, Ballsports, and Diving), we sample frames at 4-6 fps, while for longer actions (e.g., Music and Surgery), we sample at 2 fps. ... Our method, Vid Diff, is evaluated alongside these baselines, were the proposer LLM is gpt-4o-2024-08-06, the localizer embedding model is CLIP Vi T-big G-14, and frame differencer VLM is gpt-4o-2024-08-06.