reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval

Authors: Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive results on three popular benchmarks have validated the superiority of MUSE.
Researcher Affiliation	Academia	1 Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University 2 Peng Cheng Laboratory 3 Sun Yat-sen University
Pseudocode	No	The paper describes the methodology using textual explanations and figures (Figure 2), but no dedicated pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Code https://github.com/hrtang22/MUSE
Open Datasets	Yes	To validate the effectiveness of our proposed model(MUSE), we test our model on three benchmarked datasets: MSR-VTT (Xu et al. 2016) contains 10,000 You Tube videos, and each video is associated with 20 textual descriptions. Activity Net (Krishna et al. 2017) comprises 20,000 untrimmed videos of complex human activities with an average duration of two minutes. Di Demo (Anne Hendricks et al. 2017) consists of 10,464 unedited, personal videos in diverse visual settings annotated with 40,543 text descriptions.
Dataset Splits	Yes	MSR-VTT (Xu et al. 2016) ... We follow the 1k-A split (Yu, Kim, and Kim 2018) where 9,000 videos are used for training and 1,000 videos for testing. Activity Net (Krishna et al. 2017) ... We report results on the val1 split (including 10,009 training videos and 4,917 testing videos) following (Gabeur et al. 2020). Di Demo (Anne Hendricks et al. 2017) ... We follow the training and evaluation protocol in (Luo et al. 2022).
Hardware Specification	Yes	All experiments were carried out on 8 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using CLIP (Radford et al. 2021) and ViT (Dosovitskiy et al. 2020) as foundational models but does not specify version numbers for any software libraries or dependencies used in the implementation.
Experiment Setup	Yes	We set the input frame length to 12, 64, 64 and the caption token length to 32, 64, and 64 for MSR-VTT, Di De Mo, and Activity Net, respectively. For fine-tuning, we keep the training hyperparameters and settings of the base model unchanged and train MUSE with a learning rate of 10 times higher (e.g., 1e-4 for CLIP4clip and 1e-3 for MUSE). The Layer number of Res Mamba is set to 4, and the scale selected is {1, 3, 7, 14}.