ViSAGe: Video-to-Spatial Audio Generation

Authors: Jaeyeon Kim, Heeseung Yun, Gunhee Kim

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Vi SAGe produces plausible and coherent first-order ambisonics, outperforming two-stage approaches consisting of video-to-audio generation and audio spatialization. Qualitative examples further illustrate that Vi SAGe generates temporally aligned high-quality spatial audio that adapts to viewpoint changes. ... Extensive experiments on YT-Ambigen show that Vi SAGe outperforms two-stage approaches, which separately handle video-to-audio generation and audio spatialization, across all metrics.
Researcher Affiliation Academia Jaeyeon Kim, Heeseung Yun & Gunhee Kim Seoul National University EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the model architecture and code generation pattern in Section 5 and Figure 2(b), but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code No Project page: https://jaeyeonkim99.github.io/visage. This is a link to a project page, not directly to a source code repository for the methodology described in the paper.
Open Datasets No To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second You Tube video clips paired with corresponding first-order ambisonics. The paper introduces this new dataset but does not provide concrete access information (e.g., specific link, DOI, or explicit repository) for public availability.
Dataset Splits Yes YT-Ambigen comprises a total of 102,364 five-second Fo V clips with corresponding FOA and camera direction (ϕ, θ), which is divided into 81,594 / 9,604 / 11,166 clips for training, validation, and test, respectively.
Hardware Specification Yes Training is conducted on 2 NVIDIA A6000 or A40 GPUs with a batch size of 64.
Software Dependencies Yes We also utilize bfloat16 precision and Flash Attention-2 (Dao, 2024) to accelerate the training process. ... We use the audioldm_eval library (Liu et al., 2023) to compute all metrics.
Experiment Setup Yes For pretraining on VGGSound, we use a constant learning rate of 1e-4 with 4000 warmup steps. For finetuning on YT-Ambigen, we apply a constant learning rate of 1e-4 without warmup. When training from scratch on YT-Ambigen, we use a constant learning rate of 2e-4 with 4000 warmup steps. The Adam W optimizer is adopted with a weight decay of 1e-2 and a gradient clipping norm of 1.0. Training is conducted on 2 NVIDIA A6000 or A40 GPUs with a batch size of 64. ... Based on a hyperparameter sweep, we use guidance scale ω = 2.5 throughout the experiments.