Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound. ... We conduct our experiments using VGGSound [Chen et al., 2020a], a large-scale audio-visual dataset... Table 1: Evaluation results for Video-to-Audio generation across three test sets... Table 3: We explore the effect of different cinematic language variations f during training...
Researcher Affiliation Academia Feizhen Huang , Yu Wu , Yutian Lin and Bo Du School of Computer Science, Wuhan University EMAIL
Pseudocode No The paper describes methods and models using mathematical equations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a specific link to source code, nor does it contain an explicit statement about releasing its code in supplementary materials or otherwise. It mentions building upon 'the open-source Diff Foley [Luo et al., 2024]' but this refers to a third-party tool, not the authors' own implementation code.
Open Datasets Yes We conduct our experiments using VGGSound [Chen et al., 2020a], a large-scale audio-visual dataset containing over 200,000 video clips across 309 distinct sound categories.
Dataset Splits Yes We follow the original VGGSound train/test split. ... To evaluate performance under partial visibility, we create two modified test sets by applying cinematic language variations to the VGGSound [Chen et al., 2020a] test set.
Hardware Specification Yes The student model is trained for 25 epochs on 4 NVIDIA 4090 GPUs, using the Adam W optimizer with a learning rate of 5 10 4 and a total batch size of 32.
Software Dependencies No The paper mentions using a pre-trained video encoder from CAVP [Luo et al., 2024] and building upon Diff Foley [Luo et al., 2024], but it does not specify version numbers for these or any other software components (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes The input video clips are sampled at 4 frames per second (FPS)... For training, we only apply cinematic language variation fcu on VGGSound [Chen et al., 2020a] training set with k = 75%, where a1 = 0.4 and a2 = 0.6. The student model is trained for 25 epochs on 4 NVIDIA 4090 GPUs, using the Adam W optimizer with a learning rate of 5 10 4 and a total batch size of 32. ... we use only CFG [Ho and Salimans, 2022] configuration in Diff-Foley, keeping all other experimental settings unchanged, including the DPM-Solver [Lu et al., 2022] Sampler with 25 inference steps and CFG scale ̗ = 4.5.