ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior
Authors: Zhongweiyang Xu, Xulin Fan, Zhong-Qiu Wang, Xilin Jiang, Romit Roy Choudhury
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation results show that Array DPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos and codes are provided at: https://arraydps.github.io/Array DPSDemo/ and https://github.com/Array DPS/Array DPS. ... Extensive evaluation shows that Array DPS can achieve similar performance against recent supervised methods evaluated on ad-hoc microphone arrays, and performs the best among all unsupervised blind speech separation algorithms. ... 4. Experiments and Evaluation |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Champaign, USA 2Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China 3Columbia University, NYC, USA. |
| Pseudocode | Yes | Algorithm 1 Array DPS Require: {N, σi {0,...,N}, γi {0,...,N 1}, Snoise} ... Algorithm 2 Posterior Score Approximation Require: {Dθ, σi {0,...,N}, Nref, Nfg, ξ1(τ), ξ2(τ), λ} |
| Open Source Code | Yes | Audio demos and codes are provided at: https://arraydps.github.io/Array DPSDemo/ and https://github.com/Array DPS/Array DPS. ... We have open sourced Array DPS in https://github.com/Array DPS/Array DPS. |
| Open Datasets | Yes | We train this unconditional speech diffusion model on a clean subset of speech corpus Libri TTS (Zen et al., 2019). ... For evaluation, we use SMS-WSJ (Drude et al., 2019b) dataset for fixed microphone array evaluation and use Spatialized WSJ0-2Mix dataset (Wang et al., 2018) for ad-hoc microphone array evaluation. |
| Dataset Splits | Yes | The dataset consists 33,561 ( 87.4 h), 982 ( 2.5 h), and 1,332 ( 3.4 h) train, validation, and test mixtures, respectively, all in 8k Hz sampling rate. ... In general, the Spatialized WSJ0-2Mix dataset contains 20,000( 30h), 5,000 ( 10h), and 3,000 ( 5h) utterances in training, validation, and testing, respectively. |
| Hardware Specification | Yes | These models are all trained on a single A100 GPU and converges in about 5-6 days. |
| Software Dependencies | Yes | For the diffusion denoising architecture, we use the waveform domain U-Net as MSDM (Mariani et al., 2024), implemented in audio-diffusion-pytorch/v0.0.4322. ... We use the open-source torchiva toolkit (Scheibler & Saijo, 2022) |
| Experiment Setup | Yes | For the default configuration as in row 2a (Array DPS-A) in Table 1, we set N = 400 and Snoise = 1 as in Algorithm 1, σ0 = τmax = 0.8, σN = τmin = 1e 6, ρ = 10 as in Eq. 55, Smin = 0, Smax = 50, and Schurn = 30 as in Eq 56, ξ = 2, Nref = 200, Nfg = 100, and λ = 1.3 as in Algorithm 2. ... We train on speech samples with 65,536 samples ( 8.2 s) with batsh size 16 and learning rate 0.0001. The learning rate is multiplied by 0.8 every 60,000 training steps. ... We train the model for 840,000 training steps. |