reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

Authors: Heda Zuo, Weitao You, Junxian Wu, Shihong Ren, Pei Chen, Mingxu Zhou, Yujia Lu, Lingyun Sun

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality. We conduct extensive experiments which show that our model can outperform state-of-the art models significantly in terms of video-music correspondence, music diversity and application universality.
Researcher Affiliation	Academia	1College of Computer Science and Technology, Zhejiang University 2Shanghai Key Laboratory for Music Acoustics, Shanghai Conservatory of Music 3School of Design and Fashion, Zhejiang University of Science and Technology EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the model architecture and processes using natural language and mathematical formulas, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code & Cases https://chouliuzuo.github.io/GVMGen/
Open Datasets	No	we have compiled a large-scale dataset comprising diverse types of video-music pairs. For collection, we sourced our dataset from free public platforms (Bilibili and Youtube). After manual filtering, the dataset is preprocessed by clipping for training.
Dataset Splits	Yes	The ratio of training and test set is 0.85:0.15.
Hardware Specification	Yes	The training lasts for 150 epochs with 188 hours on NVIDIA A100 (single card).
Software Dependencies	No	The paper mentions several software components like 'VIT-L/14@336px', 'Music Gen', 'Encodec', and 'Adam optimizer', but it does not provide specific version numbers for these tools or any programming languages used.
Experiment Setup	Yes	We adopt VIT-L/14@336px with 24 self-attention layers. In feature transformation module, we employ 16 queries, 6 self-attention layers and 3 cross-attention layers. And the temporal cross-attention is with 48 transformer layers of 1536 as the hidden size while the Music Gen decoder is with 4 codebooks of 2048 tokens. We use Adam optimizer with learning rate of 1e-5, weight decay of 0.01, batch size of 6 and video frame rate of 1 per second. A cosine learning rate schedule with 4000 warmup steps and top-k sampling with keeping the top 250 tokens are employed.