ViSAGe: Video-to-Spatial Audio Generation
Authors: Jaeyeon Kim, Heeseung Yun, Gunhee Kim
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Vi SAGe produces plausible and coherent first-order ambisonics, outperforming two-stage approaches consisting of video-to-audio generation and audio spatialization. Qualitative examples further illustrate that Vi SAGe generates temporally aligned high-quality spatial audio that adapts to viewpoint changes. ... Extensive experiments on YT-Ambigen show that Vi SAGe outperforms two-stage approaches, which separately handle video-to-audio generation and audio spatialization, across all metrics. |
| Researcher Affiliation | Academia | Jaeyeon Kim, Heeseung Yun & Gunhee Kim Seoul National University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the model architecture and code generation pattern in Section 5 and Figure 2(b), but it does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | Project page: https://jaeyeonkim99.github.io/visage. This is a link to a project page, not directly to a source code repository for the methodology described in the paper. |
| Open Datasets | No | To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second You Tube video clips paired with corresponding first-order ambisonics. The paper introduces this new dataset but does not provide concrete access information (e.g., specific link, DOI, or explicit repository) for public availability. |
| Dataset Splits | Yes | YT-Ambigen comprises a total of 102,364 five-second Fo V clips with corresponding FOA and camera direction (ϕ, θ), which is divided into 81,594 / 9,604 / 11,166 clips for training, validation, and test, respectively. |
| Hardware Specification | Yes | Training is conducted on 2 NVIDIA A6000 or A40 GPUs with a batch size of 64. |
| Software Dependencies | Yes | We also utilize bfloat16 precision and Flash Attention-2 (Dao, 2024) to accelerate the training process. ... We use the audioldm_eval library (Liu et al., 2023) to compute all metrics. |
| Experiment Setup | Yes | For pretraining on VGGSound, we use a constant learning rate of 1e-4 with 4000 warmup steps. For finetuning on YT-Ambigen, we apply a constant learning rate of 1e-4 without warmup. When training from scratch on YT-Ambigen, we use a constant learning rate of 2e-4 with 4000 warmup steps. The Adam W optimizer is adopted with a weight decay of 1e-2 and a gradient clipping norm of 1.0. Training is conducted on 2 NVIDIA A6000 or A40 GPUs with a batch size of 64. ... Based on a hyperparameter sweep, we use guidance scale ω = 2.5 throughout the experiments. |