OmniAudio: Generating Spatial Audio from 360-Degree Video
Authors: Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Omni Audio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at github.com/liuhuadai/Omni Audio. The project website is available at Omni Audio-360V2SA.github.io. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology, Hong Kong, China 2Tongyi Lab, Alibaba Group, Hangzhou, China 3Zhejiang University, Hangzhou, China 4FAIR, Meta, USA 5Nanyang Technological University, Singapore. |
| Pseudocode | No | The paper includes architectural diagrams (Figure 9 and Figure 10) but no explicitly labeled pseudocode or algorithm blocks describing the procedural steps of the methodology. |
| Open Source Code | Yes | Code and datasets are available at github.com/liuhuadai/Omni Audio. |
| Open Datasets | Yes | Code and datasets are available at github.com/liuhuadai/Omni Audio. We create and release Sphere360, a comprehensive dataset of 103,000 video clips with spatial audio, along with its semi-automated construction pipeline. |
| Dataset Splits | No | The paper states: "We partition the dataset into training and test sets based on video IDs, to maintain the integrity of audio event distributions for evaluation and prevent data leakage." and "We construct a benchmark encompassing about 180 distinct audio events using our semi-automated data-cleaning process." However, it does not specify explicit percentages or counts for training, validation, and test splits for the main Sphere360 dataset, which is necessary for reproduction. |
| Hardware Specification | Yes | For VAE training, we employ mixed precision training with a batch size of 144 across 24 A800 GPUs for 500,000 steps. Subsequently, following Evans et al. (2024), we freeze the VAE encoder and train the VAE decoder with a latent mask ratio of 0.1 for an additional 300,000 steps. In the self-supervised pre-training phase, we apply a mask with a conditioning probability pcond of 0.1. We utilize exponential moving average and automatic mixed precision for 100,000 steps on 8 A100 GPUs, with an effective batch size of 256. For the Video-Guided fine-tuning stage, we similarly apply exponential moving average and automatic mixed precision for 50,000 steps on 8 A100 GPUs, maintaining an effective batch size of 256. |
| Software Dependencies | No | The paper mentions "Adam W (Loshchilov & Hutter, 2019) as the optimizer" but does not specify any software libraries, frameworks, or their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | For VAE training, we employ mixed precision training with a batch size of 144 across 24 A800 GPUs for 500,000 steps. Subsequently, following Evans et al. (2024), we freeze the VAE encoder and train the VAE decoder with a latent mask ratio of 0.1 for an additional 300,000 steps. We use Adam W (Loshchilov & Hutter, 2019) as the optimizer, setting the generator learning rate to 3e-5 and the discriminator learning rate to 6e-5. In the self-supervised pre-training phase, we apply a mask with a conditioning probability pcond of 0.1. We utilize exponential moving average and automatic mixed precision for 100,000 steps on 8 A100 GPUs, with an effective batch size of 256. For the Video-Guided fine-tuning stage, we similarly apply exponential moving average and automatic mixed precision for 50,000 steps on 8 A100 GPUs, maintaining an effective batch size of 256. Adam W remains our optimizer of choice, with a learning rate set at 5e-5. The flow component employs a Diffusion Transformer (Di T) with an embedding dimension of 1536. It comprises 24 layers and 24 attention heads, with local and global conditioning dimensions of 768 and 1536, respectively. The transformer operates by projecting condition tokens and adheres to a continuous transformer architecture. Table 9 also details model configurations at different scales (Large, Medium, Small). |