Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval
Authors: Dezhao Luo, Shaogang Gong, Jiabo Huang, Hailin Jin, Yang Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of our Fine-grained Video Editing framework (FVE), we validate on both video moment retrieval and video action editing tasks. ... Experiments on three datasets demonstrate the effectiveness of FVE to unseen novel semantic video moment retrieval tasks. |
| Researcher Affiliation | Collaboration | 1Queen Mary University of London 2Sony AI 3Adobe Research 4WICT, Peking University 5State Key Laboratory of General Artiļ¬cial Intelligence, Peking University EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes mathematical formulations and processes using equations and structured text, but it does not contain a clearly labeled pseudocode block or algorithm block. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it include a direct link to a code repository for the methodology described. The text "Symbol indicates our implementation with the author-released code." refers to external methods, not the authors' own FVE code. |
| Open Datasets | Yes | To assess FVE for novel semantic VMR, we employed the novel-word split (Li et al. 2022) on Charades-STA (Gao et al. 2017). For QVHighlights (Lei, Berg, and Bansal 2021) and Ta Co S (Regneri et al. 2013), we sample sentences from the standard training split and exclude them from the training set. |
| Dataset Splits | Yes | To assess FVE for novel semantic VMR, we employed the novel-word split (Li et al. 2022) on Charades-STA (Gao et al. 2017). For QVHighlights (Lei, Berg, and Bansal 2021) and Ta Co S (Regneri et al. 2013), we sample sentences from the standard training split and exclude them from the training set. In our implementation, we selected 50/300/300 sentences separately from each dataset for data generation. |
| Hardware Specification | No | The acknowledgements section mentions "Queen Mary University of London s Apocrita HPC facility from QMUL RESEARCHIT" as support, but it does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for experiments. |
| Software Dependencies | No | The paper mentions using tools like CLIP (Radford et al. 2021), DINO (Caron et al. 2021), and the Dreambooth strategy (Ruiz et al. 2023), but it does not provide specific version numbers for any software dependencies, libraries, or programming languages used in their implementation. |
| Experiment Setup | Yes | For hybrid selection, we used CLIP (Radford et al. 2021) to compute the cross-modal relevance score and DINO (Caron et al. 2021) for the uni-modal structure score. We set k to 500, 1500 and 1500 respectively for the three datasets. For the model performance disparity metric, we set l to be 100, 500 and 500 respectively for each dataset. ... We observe the best combination is k=500 and l=100. |