GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

Authors: Heda Zuo, Weitao You, Junxian Wu, Shihong Ren, Pei Chen, Mingxu Zhou, Yujia Lu, Lingyun Sun

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality. We conduct extensive experiments which show that our model can outperform state-of-the art models significantly in terms of video-music correspondence, music diversity and application universality.
Researcher Affiliation Academia 1College of Computer Science and Technology, Zhejiang University 2Shanghai Key Laboratory for Music Acoustics, Shanghai Conservatory of Music 3School of Design and Fashion, Zhejiang University of Science and Technology EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the model architecture and processes using natural language and mathematical formulas, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code & Cases https://chouliuzuo.github.io/GVMGen/
Open Datasets No we have compiled a large-scale dataset comprising diverse types of video-music pairs. For collection, we sourced our dataset from free public platforms (Bilibili and Youtube). After manual filtering, the dataset is preprocessed by clipping for training.
Dataset Splits Yes The ratio of training and test set is 0.85:0.15.
Hardware Specification Yes The training lasts for 150 epochs with 188 hours on NVIDIA A100 (single card).
Software Dependencies No The paper mentions several software components like 'VIT-L/14@336px', 'Music Gen', 'Encodec', and 'Adam optimizer', but it does not provide specific version numbers for these tools or any programming languages used.
Experiment Setup Yes We adopt VIT-L/14@336px with 24 self-attention layers. In feature transformation module, we employ 16 queries, 6 self-attention layers and 3 cross-attention layers. And the temporal cross-attention is with 48 transformer layers of 1536 as the hidden size while the Music Gen decoder is with 4 codebooks of 2048 tokens. We use Adam optimizer with learning rate of 1e-5, weight decay of 0.01, batch size of 6 and video frame rate of 1 per second. A cosine learning rate schedule with 4000 warmup steps and top-k sampling with keeping the top 250 tokens are employed.