ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries
Authors: Wangyu Xue, Chen Qian, Jiayi Wu, Yang Zhou, Wentao Liu, Ju Ren, Siming Fan, Yaoxue Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing state-of-the-art (SOTA) models. Shot VL demonstrates a significant 64% improvement over Intern VL on the Best Shot Benchmark and a notable 68% improvement on the THUMOS14 Benchmark, while maintaining SOTA performance in general image classification and retrieval. We provide a comprehensive evaluation of existing methods and propose a robust baseline Shot VL. Ablation Study. Ablation experiments are conducted in four groups (Tab. 4). |
| Researcher Affiliation | Collaboration | Wangyu Xue*1, Chen Qian*1,2, Jiayi Wu2, Yang Zhou2, Wentao Liu2, Ju Ren 1, Siming Fan 2, Yaoxue Zhang1 1Department of Computer Science and Technology, Tsinghua University 2Sense Time Research |
| Pseudocode | No | The paper includes figures describing pipelines (e.g., Figure 3: Annotation Pipeline of Shot GPT4o, Figure 5: Training and inference pipeline of Shot VL) but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Shot VL/Shot VL |
| Open Datasets | Yes | We have collected two distinct datasets: (i) Shot GPT4o Dataset... and (ii) Image-SMPLText Dataset... The images are sampled from the LAION-400M dataset and videos are sampled from the K700 dataset... Image-SMPLText Dataset, which adapts Pose Script for real-world videos to re-annotate over 13 public video datasets (Yi et al. 2023; von Marcard et al. 2018; Ionescu et al. 2014; Andriluka et al. 2018; Lin et al. 2023; Huang et al. 2022; Zhang et al. 2022; Cai et al. 2021; Kanazawa et al. 2019; Yang et al. 2023; Cheng et al. 2023). We added the COCO dataset (Chen et al. 2015), a high-quality human-written image caption dataset. |
| Dataset Splits | No | The paper mentions dividing queries for the Best Shot Benchmark (6,000 queries into Content, Action, Pose categories, 2,000 each) and using datasets with specific ratios for training (SMPLText, Shot GPT4o, and General, with a ratio of 1:5:5). However, it does not explicitly provide train/validation/test splits for their primary training data or for reproducing the model training itself. |
| Hardware Specification | Yes | The Shot VL model was trained for 2,000 iterations, with batch size 1,536 and a learning rate of 1e-5, on 24 A100 GPUs for 20 hours. |
| Software Dependencies | No | The paper states that Shot VL follows the fine-tuning pipeline of Intern VL and uses Intern VL 14B as the base model, but it does not specify version numbers for underlying software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The datasets are divided into 3 parts: SMPLText, Shot GPT4o, and General, with a ratio of 1:5:5, which ensures a stable balance between the Best Shot task and general retrieval/classification tasks. The Shot VL model was trained for 2,000 iterations, with batch size 1,536 and a learning rate of 1e-5, on 24 A100 GPUs for 20 hours. |