Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval

Authors: Yue Wu, Zhaobo Qi, Yiling Wu, Junshu Sun, Yaowei Wang, Shuhui Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Fine CVR-1M dataset demonstrate the superior performance of FDCA. Our code and dataset are available at: https://may2333.github.io/Fine CVR/. ... We compare our method with several composed retrieval methods on the Fine CVR-1M dataset. Experimental results demonstrate that our method outperforms existing methods by a clear margin. ... 5 EXPERIMENTS
Researcher Affiliation Academia Yue Wu1,2,4, Zhaobo Qi3, Yiling Wu2, Junshu Sun1, Yaowei Wang2,5, Shuhui Wang1,2 1 Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS 2 Pengcheng Laboratory 3 Harbin Institute of Technology, Weihai 4 University of Chinese Academy of Sciences 5 Harbin Institute of Technology, Shenzhen EMAIL EMAIL EMAIL EMAIL
Pseudocode No The paper describes the methodology using textual explanations and diagrams (Figure 2 and Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and dataset are available at: https://may2333.github.io/Fine CVR/. ... The source code has been submitted as supplementary materials.
Open Datasets Yes To overcome these challenges, we first introduce Fine CVR-1M, a fine-grained composed video retrieval dataset containing 1,010,071 video-text triplets with detailed textual descriptions. ... Our code and dataset are available at: https://may2333.github.io/Fine CVR/. ... We construct a benchmark Fine CVR-1M with 1M+ triplets. This benchmark supports the combined query with both reference videos and modification text for fine-grained video retrieval and is publicly available for download. ... We use four datasets, Action Genome (Ji et al., 2020), Activity Net (Fabian Caba Heilbron & Niebles, 2015), HVU (Diba et al., 2020), and MSRVTT (Xu et al., 2016) as our video source.
Dataset Splits Yes Fine CVR-1M consists of 1,000,028 triplets for training and 10,043 triplets for testing. ... We then use 606 samples as a training set and 394 as a validation set to fine-tune the open-source model LLa MA 2 (Touvron et al., 2023b).
Hardware Specification Yes All experiments are conducted on the NVIDIA RTX3090 using Py Torch.
Software Dependencies No The paper mentions software like PyTorch, CLIP, BLIP, LLaMA2, and Chat GPT, but does not specify version numbers for the key software dependencies used in their implementation.
Experiment Setup Yes For our proposed method FDCA, we utilize the frozen CLIP Res50x4 (d = 640) (Radford et al., 2021) or BLIP large (d = 256) (Li et al., 2022) as our video encoder. The model is optimized with Adam with an initial learning rate of 1e-4. We set the batch size to 1024 to maintain the performance. To avoid overfitting, we train our FDCA for 30 epochs. The m in the negation semantic regularization term LN is 0.2, while the weight λ of the negation semantic regularization term LN is set as 5. We implement the Cross-Modality Feature Alignment (CMFA) and Cross-Modality Feature Fusion (CMFF) modules using six Transformer Encoder layers.