MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval

Authors: Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive results on three popular benchmarks have validated the superiority of MUSE.
Researcher Affiliation Academia 1 Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University 2 Peng Cheng Laboratory 3 Sun Yat-sen University
Pseudocode No The paper describes the methodology using textual explanations and figures (Figure 2), but no dedicated pseudocode or algorithm blocks are provided.
Open Source Code Yes Code https://github.com/hrtang22/MUSE
Open Datasets Yes To validate the effectiveness of our proposed model(MUSE), we test our model on three benchmarked datasets: MSR-VTT (Xu et al. 2016) contains 10,000 You Tube videos, and each video is associated with 20 textual descriptions. Activity Net (Krishna et al. 2017) comprises 20,000 untrimmed videos of complex human activities with an average duration of two minutes. Di Demo (Anne Hendricks et al. 2017) consists of 10,464 unedited, personal videos in diverse visual settings annotated with 40,543 text descriptions.
Dataset Splits Yes MSR-VTT (Xu et al. 2016) ... We follow the 1k-A split (Yu, Kim, and Kim 2018) where 9,000 videos are used for training and 1,000 videos for testing. Activity Net (Krishna et al. 2017) ... We report results on the val1 split (including 10,009 training videos and 4,917 testing videos) following (Gabeur et al. 2020). Di Demo (Anne Hendricks et al. 2017) ... We follow the training and evaluation protocol in (Luo et al. 2022).
Hardware Specification Yes All experiments were carried out on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using CLIP (Radford et al. 2021) and ViT (Dosovitskiy et al. 2020) as foundational models but does not specify version numbers for any software libraries or dependencies used in the implementation.
Experiment Setup Yes We set the input frame length to 12, 64, 64 and the caption token length to 32, 64, and 64 for MSR-VTT, Di De Mo, and Activity Net, respectively. For fine-tuning, we keep the training hyperparameters and settings of the base model unchanged and train MUSE with a learning rate of 10 times higher (e.g., 1e-4 for CLIP4clip and 1e-3 for MUSE). The Layer number of Res Mamba is set to 4, and the scale selected is {1, 3, 7, 14}.