Read, Watch and Scream! Sound Generation from Text and Video

Authors: Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Our method shows the best fidelity score (FD), structure prediction (energy MAE), and AV-alignment score on VGGSound. Moreover, we achieve the best AP and energy MAE on Greatest Hits without the use of reference audio samples like Cond Foley Gen (Du et al. 2023). As shown in the qualitative study, Re Wa S can capture the challenging short transition of the video when the skateboarder jumps into the air, and no skateboarding sound appears in the video.
Researcher Affiliation Industry Yujin Jeong*, Yunji Kim, Sanghyuk Chun, Jiyoung Lee NAVER AI Lab Corresponding author: EMAIL
Pseudocode No The paper describes the method using natural language and mathematical equations, without including any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code & demo https://naver-ai.github.io/rewas
Open Datasets Yes We compare our method and other state-of-the-art video-to-audio generation models (Du et al. 2023; Xing et al. 2024; Luo et al. 2024; Iashin and Rahtu 2021; Sheffer and Adi 2023) on two video-audio aligned datasets, VGGSound (Chen et al. 2020) and Greatest Hits (Owens et al. 2016).
Dataset Splits Yes We randomly sampled 3K videos to construct VGGSound test subset. To evaluate temporal alignment accuracy, we use Greatest Hits (Owens et al. 2016) test set including the videos of hitting a drumstick with materials. Since Greatest Hits samples have a distinct audio property compared to the other audio samples, we fine-tune Re Wa S on the Greatest Hits training samples.
Hardware Specification Yes e.g., Luo et al. (2024) uses 8 A100 GPUs for 140 hours for feature alignment and LDM tuning, whereas we use 4 V100 GPUs for total 33 hours).
Software Dependencies No The paper mentions using the NAVER Smart Machine Learning (NSML) platform but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used for implementation.
Experiment Setup Yes We train 22M parameters for video projection to audio conditional control, and 182M parameters for fine-tuning the Audio LDM with our energy adapter. During training, we randomly drop Ey with the probability 0.3 for better controls. We use DDIM (Song, Meng, and Ermon 2020) to generate sound from the noise.