ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior

Authors: Zhongweiyang Xu, Xulin Fan, Zhong-Qiu Wang, Xilin Jiang, Romit Roy Choudhury

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation results show that Array DPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos and codes are provided at: https://arraydps.github.io/Array DPSDemo/ and https://github.com/Array DPS/Array DPS. ... Extensive evaluation shows that Array DPS can achieve similar performance against recent supervised methods evaluated on ad-hoc microphone arrays, and performs the best among all unsupervised blind speech separation algorithms. ... 4. Experiments and Evaluation
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Champaign, USA 2Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China 3Columbia University, NYC, USA.
Pseudocode Yes Algorithm 1 Array DPS Require: {N, σi {0,...,N}, γi {0,...,N 1}, Snoise} ... Algorithm 2 Posterior Score Approximation Require: {Dθ, σi {0,...,N}, Nref, Nfg, ξ1(τ), ξ2(τ), λ}
Open Source Code Yes Audio demos and codes are provided at: https://arraydps.github.io/Array DPSDemo/ and https://github.com/Array DPS/Array DPS. ... We have open sourced Array DPS in https://github.com/Array DPS/Array DPS.
Open Datasets Yes We train this unconditional speech diffusion model on a clean subset of speech corpus Libri TTS (Zen et al., 2019). ... For evaluation, we use SMS-WSJ (Drude et al., 2019b) dataset for fixed microphone array evaluation and use Spatialized WSJ0-2Mix dataset (Wang et al., 2018) for ad-hoc microphone array evaluation.
Dataset Splits Yes The dataset consists 33,561 ( 87.4 h), 982 ( 2.5 h), and 1,332 ( 3.4 h) train, validation, and test mixtures, respectively, all in 8k Hz sampling rate. ... In general, the Spatialized WSJ0-2Mix dataset contains 20,000( 30h), 5,000 ( 10h), and 3,000 ( 5h) utterances in training, validation, and testing, respectively.
Hardware Specification Yes These models are all trained on a single A100 GPU and converges in about 5-6 days.
Software Dependencies Yes For the diffusion denoising architecture, we use the waveform domain U-Net as MSDM (Mariani et al., 2024), implemented in audio-diffusion-pytorch/v0.0.4322. ... We use the open-source torchiva toolkit (Scheibler & Saijo, 2022)
Experiment Setup Yes For the default configuration as in row 2a (Array DPS-A) in Table 1, we set N = 400 and Snoise = 1 as in Algorithm 1, σ0 = τmax = 0.8, σN = τmin = 1e 6, ρ = 10 as in Eq. 55, Smin = 0, Smax = 50, and Schurn = 30 as in Eq 56, ξ = 2, Nref = 200, Nfg = 100, and λ = 1.3 as in Algorithm 2. ... We train on speech samples with 65,536 samples ( 8.2 s) with batsh size 16 and learning rate 0.0001. The learning rate is multiplied by 0.8 every 60,000 training steps. ... We train the model for 840,000 training steps.