MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval
Authors: Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive results on three popular benchmarks have validated the superiority of MUSE. |
| Researcher Affiliation | Academia | 1 Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University 2 Peng Cheng Laboratory 3 Sun Yat-sen University |
| Pseudocode | No | The paper describes the methodology using textual explanations and figures (Figure 2), but no dedicated pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Code https://github.com/hrtang22/MUSE |
| Open Datasets | Yes | To validate the effectiveness of our proposed model(MUSE), we test our model on three benchmarked datasets: MSR-VTT (Xu et al. 2016) contains 10,000 You Tube videos, and each video is associated with 20 textual descriptions. Activity Net (Krishna et al. 2017) comprises 20,000 untrimmed videos of complex human activities with an average duration of two minutes. Di Demo (Anne Hendricks et al. 2017) consists of 10,464 unedited, personal videos in diverse visual settings annotated with 40,543 text descriptions. |
| Dataset Splits | Yes | MSR-VTT (Xu et al. 2016) ... We follow the 1k-A split (Yu, Kim, and Kim 2018) where 9,000 videos are used for training and 1,000 videos for testing. Activity Net (Krishna et al. 2017) ... We report results on the val1 split (including 10,009 training videos and 4,917 testing videos) following (Gabeur et al. 2020). Di Demo (Anne Hendricks et al. 2017) ... We follow the training and evaluation protocol in (Luo et al. 2022). |
| Hardware Specification | Yes | All experiments were carried out on 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using CLIP (Radford et al. 2021) and ViT (Dosovitskiy et al. 2020) as foundational models but does not specify version numbers for any software libraries or dependencies used in the implementation. |
| Experiment Setup | Yes | We set the input frame length to 12, 64, 64 and the caption token length to 32, 64, and 64 for MSR-VTT, Di De Mo, and Activity Net, respectively. For fine-tuning, we keep the training hyperparameters and settings of the base model unchanged and train MUSE with a learning rate of 10 times higher (e.g., 1e-4 for CLIP4clip and 1e-3 for MUSE). The Layer number of Res Mamba is set to 4, and the scale selected is {1, 3, 7, 14}. |