reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Authors: Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.
Researcher Affiliation	Academia	Peiwen Sun , Sitong Cheng , Xiangtai Li , Zhen Ye , Huadai Liu , Honggang Zhang , Wei Xue Q , Yike Guo Q Hong Kong University of Science and Technology, Beijing University of Posts and Telecommunications, Nanyang Technological University, Zhejiang University
Pseudocode	No	The paper describes mathematical formulations and processes using equations (1) through (8) in Sections 4.3 and 4.4, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code	Yes	Developing a semi-automated pipeline to create an open-source, large-scale, stereo audio dataset with spatial captions, BEWO-1M and supporting both large-scale training and precise evaluation. All evaluation codes will be publicly accessible.
Open Datasets	Yes	We propose BEWO-1M, a large-scale stereo audio dataset with spatial captions, as the first to the best of our knowledge. Developing a semi-automated pipeline to create an open-source, large-scale, stereo audio dataset with spatial captions, BEWO-1M... We have taken investigation into each previous dataset involved in our BEWO-1M. For the purpose of open access, we follow each dataset involved in BEWO-1M and apply the license including the Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.
Dataset Splits	Yes	In summary, we constructed 2.8k hours of training audio with more than 1M audio-text pairs and approximately 17 hours of validation data with 6.2k pairs. The test and validation subsets from Audio Caps are used to construct our test set, while all other data is used for the training set.
Hardware Specification	Yes	For our main experiments, we train a text-conditional Diffusion-Transformer (Di T) (Levy et al., 2023; Peebles & Xie, 2023), which is optimized using 8 NVIDIA RTX 4090 GPUs for 500K steps. We train this bidirectional multimodal encoder for 100 epochs on 1 NVIDIA RTX 4090.
Software Dependencies	No	The paper mentions various software components such as 'T5 encoder', 'DPMSolver++', 'Adam W optimizer', 'Pyroomacoustics', 'gpu RIR', 'GPT-4 and GPT-4o', 'CLIP', 'Sentence Transformer', 'FAISS library', 'Mask-RCNN', 'CNN14 from PANNs', and 'BERT'. However, it generally refers to these tools by name or by citing their original papers without providing specific version numbers for the software libraries or frameworks themselves (e.g., PyTorch, TensorFlow, CUDA versions are not mentioned).
Experiment Setup	Yes	The base learning rate is set to 2e-5 with a batch size of 128 and audio length of 10s. It integrates a conditioning mechanism consisting of multiple configurations: a text-based prompt processed by a T5 transformer model ( T5-base with a maximum length of 128), and azimuth state encoding with an output dimension of 768. The conditioning dimension is set at 768. The diffusion component utilizes a Diffusion transformer (Di T) with settings that include 64 input/output channels, an embedding dimension of 1536, 24 layers, 24 attention heads, and both local and global conditioning dimensions of 768 and 1536, respectively. Notably, the transformer operates with projecting condition tokens and adheres to a continuous transformer architecture. For training, an exponential moving average (EMA) is used alongside an Adam W optimizer with a learning rate of 2e 5, beta values of [0.9, 0.999], and a weight decay of 1e 3, complemented by an Inverse LR scheduler that features an inv gamma of 1e6, a power of 0.5, and a high warmup proportion of 0.99. This configuration underscores our commitment to refining audio quality and temporal alignment in generative tasks. During inference, we use the DPMSolver++ Lu et al. (2022) for 100 steps with classifier-free guidance (scale of 6.0).