Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Authors: Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules. |
| Researcher Affiliation | Academia | Peiwen Sun , Sitong Cheng , Xiangtai Li , Zhen Ye , Huadai Liu , Honggang Zhang , Wei Xue Q , Yike Guo Q Hong Kong University of Science and Technology, Beijing University of Posts and Telecommunications, Nanyang Technological University, Zhejiang University |
| Pseudocode | No | The paper describes mathematical formulations and processes using equations (1) through (8) in Sections 4.3 and 4.4, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code. |
| Open Source Code | Yes | Developing a semi-automated pipeline to create an open-source, large-scale, stereo audio dataset with spatial captions, BEWO-1M and supporting both large-scale training and precise evaluation. All evaluation codes will be publicly accessible. |
| Open Datasets | Yes | We propose BEWO-1M, a large-scale stereo audio dataset with spatial captions, as the first to the best of our knowledge. Developing a semi-automated pipeline to create an open-source, large-scale, stereo audio dataset with spatial captions, BEWO-1M... We have taken investigation into each previous dataset involved in our BEWO-1M. For the purpose of open access, we follow each dataset involved in BEWO-1M and apply the license including the Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising. |
| Dataset Splits | Yes | In summary, we constructed 2.8k hours of training audio with more than 1M audio-text pairs and approximately 17 hours of validation data with 6.2k pairs. The test and validation subsets from Audio Caps are used to construct our test set, while all other data is used for the training set. |
| Hardware Specification | Yes | For our main experiments, we train a text-conditional Diffusion-Transformer (Di T) (Levy et al., 2023; Peebles & Xie, 2023), which is optimized using 8 NVIDIA RTX 4090 GPUs for 500K steps. We train this bidirectional multimodal encoder for 100 epochs on 1 NVIDIA RTX 4090. |
| Software Dependencies | No | The paper mentions various software components such as 'T5 encoder', 'DPMSolver++', 'Adam W optimizer', 'Pyroomacoustics', 'gpu RIR', 'GPT-4 and GPT-4o', 'CLIP', 'Sentence Transformer', 'FAISS library', 'Mask-RCNN', 'CNN14 from PANNs', and 'BERT'. However, it generally refers to these tools by name or by citing their original papers without providing specific version numbers for the software libraries or frameworks themselves (e.g., PyTorch, TensorFlow, CUDA versions are not mentioned). |
| Experiment Setup | Yes | The base learning rate is set to 2e-5 with a batch size of 128 and audio length of 10s. It integrates a conditioning mechanism consisting of multiple configurations: a text-based prompt processed by a T5 transformer model ( T5-base with a maximum length of 128), and azimuth state encoding with an output dimension of 768. The conditioning dimension is set at 768. The diffusion component utilizes a Diffusion transformer (Di T) with settings that include 64 input/output channels, an embedding dimension of 1536, 24 layers, 24 attention heads, and both local and global conditioning dimensions of 768 and 1536, respectively. Notably, the transformer operates with projecting condition tokens and adheres to a continuous transformer architecture. For training, an exponential moving average (EMA) is used alongside an Adam W optimizer with a learning rate of 2e 5, beta values of [0.9, 0.999], and a weight decay of 1e 3, complemented by an Inverse LR scheduler that features an inv gamma of 1e6, a power of 0.5, and a high warmup proportion of 0.99. This configuration underscores our commitment to refining audio quality and temporal alignment in generative tasks. During inference, we use the DPMSolver++ Lu et al. (2022) for 100 steps with classifier-free guidance (scale of 6.0). |