Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Authors: Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that our method outperforms the SOTA MLLM-based and zero-shot models on several public datasets, including QVHighlights, Activity Net-Captions, and Charades-STA.
Researcher Affiliation Academia Yifang Xu1, Yunzhuo Sun2, Benxiang Zhai1, Ming Li1, Wenxin Liang2, Yang Li1, Sidan Du1 1Nanjing University 2Dalian University of Technology EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes steps for query debiasing, span generation, and span selection in paragraph text, for example in Section 3.3 describing the Span generator: 'Firstly, we compute the inverse cumulative histogram of Sf i , with η bins. We then traverse these bins in reverse order to find the first bin containing at least κ moments, using its left endpoint value as the adaptive threshold γ. Next, we iterate through Sf i in temporal order. If Sf i,j exceeds γ, the corresponding moment is marked as the starting moment. When the similarities of τ consecutive moments all fall below γ, we mark the final moment with a similarity exceeding γ as the ending moment. Finally, we repeat the above process to generate a set of candidate spans T p from Sf:'
Open Source Code No The paper does not contain an explicit statement about releasing source code for Moment-GPT nor provides a direct link to a code repository for its methodology. It mentions 'github' in the context of Mini GPT-v2 but not for the current work.
Open Datasets Yes To evaluate our proposed method, we conduct experiments on three datasets with different topics: QVHighlights (Lei et al. 2021), Charades-STA (Gao et al. 2017), Activity Net-Captions (Krishna et al. 2017).
Dataset Splits Yes We conduct experiments on three datasets with different topics: QVHighlights (Lei et al. 2021), Charades-STA (Gao et al. 2017), Activity Net-Captions (Krishna et al. 2017). Table 1 presents performance metrics for QVHighlights on both 'test' and 'val' sets, indicating the use of predefined splits for this benchmark dataset.
Hardware Specification Yes All experiments are conducted on 1 NVIDIA A100 GPU.
Software Dependencies No The paper mentions several MLLM models used (LLa MA-3-8B, Mini GPT-v2-7B, Video-Chat GPT based on Vicuna-7B-v1.1) but does not provide specific ancillary software details such as programming language versions, library versions, or specific solver versions.
Experiment Setup Yes Following previous works (Huang et al. 2023a; Lei et al. 2021), we set the frame rates of videos from Charades-STA, Activity Net-Captions, and QVHighlights to 1, 1, and 0.5, respectively. The employed MLLM models include LLa MA-3-8B, Mini GPT-v2-7B, and Video-Chat GPT based on Vicuna-7B-v1.1 (Zheng et al. 2024). To reduce the randomness of results, we configure the temperatures of LLa MA-3, Mini GPTv2, and Video-Chat GPT to 0.3, 0.2, and 0.2, respectively. The number of histogram bins η is empirically fixed to 10. The hidden dimension d of LLa MA-3 is 4096. We set the number of debiased queries Nd to 3, the counting threshold κ to 7, the number of consecutive moments τ to 5, the distance coefficient λ to 0.2, and the Io U threshold σ in NMS to 0.9.