Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models
Authors: Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate that our method outperforms the SOTA MLLM-based and zero-shot models on several public datasets, including QVHighlights, Activity Net-Captions, and Charades-STA. |
| Researcher Affiliation | Academia | Yifang Xu1, Yunzhuo Sun2, Benxiang Zhai1, Ming Li1, Wenxin Liang2, Yang Li1, Sidan Du1 1Nanjing University 2Dalian University of Technology EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes steps for query debiasing, span generation, and span selection in paragraph text, for example in Section 3.3 describing the Span generator: 'Firstly, we compute the inverse cumulative histogram of Sf i , with η bins. We then traverse these bins in reverse order to find the first bin containing at least κ moments, using its left endpoint value as the adaptive threshold γ. Next, we iterate through Sf i in temporal order. If Sf i,j exceeds γ, the corresponding moment is marked as the starting moment. When the similarities of τ consecutive moments all fall below γ, we mark the final moment with a similarity exceeding γ as the ending moment. Finally, we repeat the above process to generate a set of candidate spans T p from Sf:' |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for Moment-GPT nor provides a direct link to a code repository for its methodology. It mentions 'github' in the context of Mini GPT-v2 but not for the current work. |
| Open Datasets | Yes | To evaluate our proposed method, we conduct experiments on three datasets with different topics: QVHighlights (Lei et al. 2021), Charades-STA (Gao et al. 2017), Activity Net-Captions (Krishna et al. 2017). |
| Dataset Splits | Yes | We conduct experiments on three datasets with different topics: QVHighlights (Lei et al. 2021), Charades-STA (Gao et al. 2017), Activity Net-Captions (Krishna et al. 2017). Table 1 presents performance metrics for QVHighlights on both 'test' and 'val' sets, indicating the use of predefined splits for this benchmark dataset. |
| Hardware Specification | Yes | All experiments are conducted on 1 NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions several MLLM models used (LLa MA-3-8B, Mini GPT-v2-7B, Video-Chat GPT based on Vicuna-7B-v1.1) but does not provide specific ancillary software details such as programming language versions, library versions, or specific solver versions. |
| Experiment Setup | Yes | Following previous works (Huang et al. 2023a; Lei et al. 2021), we set the frame rates of videos from Charades-STA, Activity Net-Captions, and QVHighlights to 1, 1, and 0.5, respectively. The employed MLLM models include LLa MA-3-8B, Mini GPT-v2-7B, and Video-Chat GPT based on Vicuna-7B-v1.1 (Zheng et al. 2024). To reduce the randomness of results, we configure the temperatures of LLa MA-3, Mini GPTv2, and Video-Chat GPT to 0.3, 0.2, and 0.2, respectively. The number of histogram bins η is empirically fixed to 10. The hidden dimension d of LLa MA-3 is 4096. We set the number of debiased queries Nd to 3, the counting threshold κ to 7, the number of consecutive moments τ to 5, the distance coefficient λ to 0.2, and the Io U threshold σ in NMS to 0.9. |