reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries

Authors: Wangyu Xue, Chen Qian, Jiayi Wu, Yang Zhou, Wentao Liu, Ju Ren, Siming Fan, Yaoxue Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We highlight the impressive zero-shot capabilities of our model and offer comparative analyses with existing state-of-the-art (SOTA) models. Shot VL demonstrates a significant 64% improvement over Intern VL on the Best Shot Benchmark and a notable 68% improvement on the THUMOS14 Benchmark, while maintaining SOTA performance in general image classification and retrieval. We provide a comprehensive evaluation of existing methods and propose a robust baseline Shot VL. Ablation Study. Ablation experiments are conducted in four groups (Tab. 4).
Researcher Affiliation	Collaboration	Wangyu Xue1, Chen Qian1,2, Jiayi Wu2, Yang Zhou2, Wentao Liu2, Ju Ren 1, Siming Fan 2, Yaoxue Zhang1 1Department of Computer Science and Technology, Tsinghua University 2Sense Time Research
Pseudocode	No	The paper includes figures describing pipelines (e.g., Figure 3: Annotation Pipeline of Shot GPT4o, Figure 5: Training and inference pipeline of Shot VL) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Shot VL/Shot VL
Open Datasets	Yes	We have collected two distinct datasets: (i) Shot GPT4o Dataset... and (ii) Image-SMPLText Dataset... The images are sampled from the LAION-400M dataset and videos are sampled from the K700 dataset... Image-SMPLText Dataset, which adapts Pose Script for real-world videos to re-annotate over 13 public video datasets (Yi et al. 2023; von Marcard et al. 2018; Ionescu et al. 2014; Andriluka et al. 2018; Lin et al. 2023; Huang et al. 2022; Zhang et al. 2022; Cai et al. 2021; Kanazawa et al. 2019; Yang et al. 2023; Cheng et al. 2023). We added the COCO dataset (Chen et al. 2015), a high-quality human-written image caption dataset.
Dataset Splits	No	The paper mentions dividing queries for the Best Shot Benchmark (6,000 queries into Content, Action, Pose categories, 2,000 each) and using datasets with specific ratios for training (SMPLText, Shot GPT4o, and General, with a ratio of 1:5:5). However, it does not explicitly provide train/validation/test splits for their primary training data or for reproducing the model training itself.
Hardware Specification	Yes	The Shot VL model was trained for 2,000 iterations, with batch size 1,536 and a learning rate of 1e-5, on 24 A100 GPUs for 20 hours.
Software Dependencies	No	The paper states that Shot VL follows the fine-tuning pipeline of Intern VL and uses Intern VL 14B as the base model, but it does not specify version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The datasets are divided into 3 parts: SMPLText, Shot GPT4o, and General, with a ratio of 1:5:5, which ensures a stable balance between the Best Shot task and general retrieval/classification tasks. The Shot VL model was trained for 2,000 iterations, with batch size 1,536 and a learning rate of 1e-5, on 24 A100 GPUs for 20 hours.