Tri-Ergon: Fine-Grained Video-to-Audio Generation with Multi-Modal Conditions and LUFS Control

Authors: Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, Yiran Zhong

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our framework outperforms existing state-of-the-art methods, as demonstrated by both qualitative and quantitative results. ... Experiments Model Configuration and Architecture Details ... Evaluation Metrics ... Video-to-Audio Generation Results ... Ablation Study
Researcher Affiliation Collaboration Bingliang Li1,2*, Fengyu Yang2*, Yuxin Mao3, Qingwen Ye1, Hongkai Chen1 , Yiran Zhong4 1vivo Mobile Communication Co., Ltd 2The Chinese University of Hong Kong, Shenzhen 3Northwestern Polytechnical University 4Open NLPLab
Pseudocode No The paper describes the methodology using mathematical formulations and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project website https://tri-ergon.github.io/Tri-Ergon/
Open Datasets Yes We introduce MM-V2A, a comprehensive V2A dataset with high-fidelity, long-duration, open-vocabulary multimodal labeling for training our proposed Tri-Ergon. ... The Tri-Ergon S model is trained and evaluated exclusively on the VGGSound dataset (Chen et al. 2020)... MM-V2A aggregates 69,749 videos from VIDAL10M, 44,637 videos from HD-VILA-100M, 128,079 videos from Audio Set, and the complete sets of VGG-Sound, VALOR-32K, BBC Sound Effects, and FAVDBench. For the VAE model, we collect several dataset in various domain to achieve optimal reconstruction results, including general audio: Audio Set, FSD50K3, FAVDBench, Urban Sound8K (Salamon, Jacoby, and Bello 2014), VGGSound, VIDAL-10M, HD-VILA-100M, sound effect: BBC Sound Effect Library, music: MTG-Jamendo (Bogdanov et al. 2019), and human voice: Common Voice Corpus 1 (Eng). BBC Sound Effect Library2. freesound.org.
Dataset Splits No The paper mentions training on various datasets and evaluating on the VGGSound test set but does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or specific predefined splits) for reproduction.
Hardware Specification Yes We first train the audio VAE using automatic mixed precision for 1M steps with an effective batch size of 128 on 8 A100 GPUs. ... The training process was conducted on 32 A100 GPUs, with an effective batch size of 256. ... Both modules are trained on 8 A100 GPUs with a batch size of 128.
Software Dependencies No The paper describes various models and frameworks used (e.g., DDPM, Di T, DINO-V2, VAE), but does not provide specific version numbers for any software libraries, programming languages, or other dependencies required for reproduction.
Experiment Setup Yes We first train the audio VAE using automatic mixed precision for 1M steps with an effective batch size of 128 on 8 A100 GPUs. ... The losses are weighted as follows: 1.0 for spectral losses, 0.1 for adversarial losses, 5.0 for feature matching loss, and 1 10 1 for KL loss. ... For the Tri-Ergon-S model, we employ a Di T architecture comprising 20 Di T blocks as the diffusion backbone, with 24 attention heads, an embedding dimension of 1536, and a total of 881M parameters. This model is trained for 7.2 105 steps. Additionally, for Tri-Ergon-L, we utilize 24 Di T blocks ... It contains 1.1B parameters and is trained for 2.2 105 steps. Both models are trained using the v-objective (Salimans and Ho 2022), with a cosine noise schedule and continuous denoising timesteps. The training process was conducted on 32 A100 GPUs, with an effective batch size of 256. For the conditioning set Em, we apply a 30% probability of independently dropping the embeddings e L, e A, and e V. Additionally, a 10% dropout rate is applied to the overall conditioning set E to enable classifierfree guidance. ... The TLUF S S module is based on a transformer encoder architecture with 12 layers, 8 attention heads, and an embedding dimension of 768. In contrast, TLUF S L features a more complex architecture with 14 layers and 12 attention heads, while maintaining the same embedding dimension of 768. TLUF S S is trained on the VGGSound dataset for 10 epochs, whereas TLUF S L is trained on our proposed dataset for 30 epochs. Both modules are trained on 8 A100 GPUs with a batch size of 128. ... our model achieves optimal performance with a CFG scale ω set to 7 and an inference step count of 100.