Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval
Authors: Yue Wu, Zhaobo Qi, Yiling Wu, Junshu Sun, Yaowei Wang, Shuhui Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Fine CVR-1M dataset demonstrate the superior performance of FDCA. Our code and dataset are available at: https://may2333.github.io/Fine CVR/. ... We compare our method with several composed retrieval methods on the Fine CVR-1M dataset. Experimental results demonstrate that our method outperforms existing methods by a clear margin. ... 5 EXPERIMENTS |
| Researcher Affiliation | Academia | Yue Wu1,2,4, Zhaobo Qi3, Yiling Wu2, Junshu Sun1, Yaowei Wang2,5, Shuhui Wang1,2 1 Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS 2 Pengcheng Laboratory 3 Harbin Institute of Technology, Weihai 4 University of Chinese Academy of Sciences 5 Harbin Institute of Technology, Shenzhen EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and diagrams (Figure 2 and Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and dataset are available at: https://may2333.github.io/Fine CVR/. ... The source code has been submitted as supplementary materials. |
| Open Datasets | Yes | To overcome these challenges, we first introduce Fine CVR-1M, a fine-grained composed video retrieval dataset containing 1,010,071 video-text triplets with detailed textual descriptions. ... Our code and dataset are available at: https://may2333.github.io/Fine CVR/. ... We construct a benchmark Fine CVR-1M with 1M+ triplets. This benchmark supports the combined query with both reference videos and modification text for fine-grained video retrieval and is publicly available for download. ... We use four datasets, Action Genome (Ji et al., 2020), Activity Net (Fabian Caba Heilbron & Niebles, 2015), HVU (Diba et al., 2020), and MSRVTT (Xu et al., 2016) as our video source. |
| Dataset Splits | Yes | Fine CVR-1M consists of 1,000,028 triplets for training and 10,043 triplets for testing. ... We then use 606 samples as a training set and 394 as a validation set to fine-tune the open-source model LLa MA 2 (Touvron et al., 2023b). |
| Hardware Specification | Yes | All experiments are conducted on the NVIDIA RTX3090 using Py Torch. |
| Software Dependencies | No | The paper mentions software like PyTorch, CLIP, BLIP, LLaMA2, and Chat GPT, but does not specify version numbers for the key software dependencies used in their implementation. |
| Experiment Setup | Yes | For our proposed method FDCA, we utilize the frozen CLIP Res50x4 (d = 640) (Radford et al., 2021) or BLIP large (d = 256) (Li et al., 2022) as our video encoder. The model is optimized with Adam with an initial learning rate of 1e-4. We set the batch size to 1024 to maintain the performance. To avoid overfitting, we train our FDCA for 30 epochs. The m in the negation semantic regularization term LN is 0.2, while the weight λ of the negation semantic regularization term LN is set as 5. We implement the Cross-Modality Feature Alignment (CMFA) and Cross-Modality Feature Fusion (CMFF) modules using six Transformer Encoder layers. |