Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval

Authors: Dezhao Luo, Shaogang Gong, Jiabo Huang, Hailin Jin, Yang Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of our Fine-grained Video Editing framework (FVE), we validate on both video moment retrieval and video action editing tasks. ... Experiments on three datasets demonstrate the effectiveness of FVE to unseen novel semantic video moment retrieval tasks.
Researcher Affiliation Collaboration 1Queen Mary University of London 2Sony AI 3Adobe Research 4WICT, Peking University 5State Key Laboratory of General Artificial Intelligence, Peking University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes mathematical formulations and processes using equations and structured text, but it does not contain a clearly labeled pseudocode block or algorithm block.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code, nor does it include a direct link to a code repository for the methodology described. The text "Symbol indicates our implementation with the author-released code." refers to external methods, not the authors' own FVE code.
Open Datasets Yes To assess FVE for novel semantic VMR, we employed the novel-word split (Li et al. 2022) on Charades-STA (Gao et al. 2017). For QVHighlights (Lei, Berg, and Bansal 2021) and Ta Co S (Regneri et al. 2013), we sample sentences from the standard training split and exclude them from the training set.
Dataset Splits Yes To assess FVE for novel semantic VMR, we employed the novel-word split (Li et al. 2022) on Charades-STA (Gao et al. 2017). For QVHighlights (Lei, Berg, and Bansal 2021) and Ta Co S (Regneri et al. 2013), we sample sentences from the standard training split and exclude them from the training set. In our implementation, we selected 50/300/300 sentences separately from each dataset for data generation.
Hardware Specification No The acknowledgements section mentions "Queen Mary University of London s Apocrita HPC facility from QMUL RESEARCHIT" as support, but it does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for experiments.
Software Dependencies No The paper mentions using tools like CLIP (Radford et al. 2021), DINO (Caron et al. 2021), and the Dreambooth strategy (Ruiz et al. 2023), but it does not provide specific version numbers for any software dependencies, libraries, or programming languages used in their implementation.
Experiment Setup Yes For hybrid selection, we used CLIP (Radford et al. 2021) to compute the cross-modal relevance score and DINO (Caron et al. 2021) for the uni-modal structure score. We set k to 500, 1500 and 1500 respectively for the three datasets. For the model performance disparity metric, we set l to be 100, 500 and 500 respectively for each dataset. ... We observe the best combination is k=500 and l=100.