reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Read, Watch and Scream! Sound Generation from Text and Video

Authors: Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Our method shows the best fidelity score (FD), structure prediction (energy MAE), and AV-alignment score on VGGSound. Moreover, we achieve the best AP and energy MAE on Greatest Hits without the use of reference audio samples like Cond Foley Gen (Du et al. 2023). As shown in the qualitative study, Re Wa S can capture the challenging short transition of the video when the skateboarder jumps into the air, and no skateboarding sound appears in the video.
Researcher Affiliation	Industry	Yujin Jeong*, Yunji Kim, Sanghyuk Chun, Jiyoung Lee NAVER AI Lab Corresponding author: EMAIL
Pseudocode	No	The paper describes the method using natural language and mathematical equations, without including any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code & demo https://naver-ai.github.io/rewas
Open Datasets	Yes	We compare our method and other state-of-the-art video-to-audio generation models (Du et al. 2023; Xing et al. 2024; Luo et al. 2024; Iashin and Rahtu 2021; Sheffer and Adi 2023) on two video-audio aligned datasets, VGGSound (Chen et al. 2020) and Greatest Hits (Owens et al. 2016).
Dataset Splits	Yes	We randomly sampled 3K videos to construct VGGSound test subset. To evaluate temporal alignment accuracy, we use Greatest Hits (Owens et al. 2016) test set including the videos of hitting a drumstick with materials. Since Greatest Hits samples have a distinct audio property compared to the other audio samples, we fine-tune Re Wa S on the Greatest Hits training samples.
Hardware Specification	Yes	e.g., Luo et al. (2024) uses 8 A100 GPUs for 140 hours for feature alignment and LDM tuning, whereas we use 4 V100 GPUs for total 33 hours).
Software Dependencies	No	The paper mentions using the NAVER Smart Machine Learning (NSML) platform but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used for implementation.
Experiment Setup	Yes	We train 22M parameters for video projection to audio conditional control, and 182M parameters for fine-tuning the Audio LDM with our energy adapter. During training, we randomly drop Ey with the probability 0.3 for better controls. We use DDIM (Song, Meng, and Ermon 2020) to generate sound from the noise.