Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation
Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound. ... We conduct our experiments using VGGSound [Chen et al., 2020a], a large-scale audio-visual dataset... Table 1: Evaluation results for Video-to-Audio generation across three test sets... Table 3: We explore the effect of different cinematic language variations f during training... |
| Researcher Affiliation | Academia | Feizhen Huang , Yu Wu , Yutian Lin and Bo Du School of Computer Science, Wuhan University EMAIL |
| Pseudocode | No | The paper describes methods and models using mathematical equations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a specific link to source code, nor does it contain an explicit statement about releasing its code in supplementary materials or otherwise. It mentions building upon 'the open-source Diff Foley [Luo et al., 2024]' but this refers to a third-party tool, not the authors' own implementation code. |
| Open Datasets | Yes | We conduct our experiments using VGGSound [Chen et al., 2020a], a large-scale audio-visual dataset containing over 200,000 video clips across 309 distinct sound categories. |
| Dataset Splits | Yes | We follow the original VGGSound train/test split. ... To evaluate performance under partial visibility, we create two modified test sets by applying cinematic language variations to the VGGSound [Chen et al., 2020a] test set. |
| Hardware Specification | Yes | The student model is trained for 25 epochs on 4 NVIDIA 4090 GPUs, using the Adam W optimizer with a learning rate of 5 10 4 and a total batch size of 32. |
| Software Dependencies | No | The paper mentions using a pre-trained video encoder from CAVP [Luo et al., 2024] and building upon Diff Foley [Luo et al., 2024], but it does not specify version numbers for these or any other software components (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | The input video clips are sampled at 4 frames per second (FPS)... For training, we only apply cinematic language variation fcu on VGGSound [Chen et al., 2020a] training set with k = 75%, where a1 = 0.4 and a2 = 0.6. The student model is trained for 25 epochs on 4 NVIDIA 4090 GPUs, using the Adam W optimizer with a learning rate of 5 10 4 and a total batch size of 32. ... we use only CFG [Ho and Salimans, 2022] configuration in Diff-Foley, keeping all other experimental settings unchanged, including the DPM-Solver [Lu et al., 2022] Sampler with 25 inference steps and CFG scale ̗ = 4.5. |