OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
Authors: Xize Cheng, Siqi Zheng, zehan wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of Omni Sep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2Alibaba Group |
| Pseudocode | No | The paper describes the methodology in prose and mathematical equations but does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The code and models will be released, and the main contributions of this paper are as follows: |
| Open Datasets | Yes | We conducted experiments on two datasets, VGGSOUND (Chen et al., 2020) and MUSIC (Zhao et al., 2018) |
| Dataset Splits | Yes | For comparison with previous studies, we conducted sound separation experiments following the CLIPSEP. This entailed training our model on the MUSIC dataset and evaluating its performance on itself. Additionally, we trained the model on the VGGSOUND dataset and evaluated its performance on the VGGSOUND-CLEAN+ and MUSIC-CLEAN+ datasets, which contain manually processed clean sound separation evaluation samples. |
| Hardware Specification | Yes | All experiments were conducted on a single A800 GPU. |
| Software Dependencies | No | The paper mentions software like 'Adam optimizer' and 'museval' for SDR computation but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For all audio samples, we conducted experiments on samples of length 65535 (approximately 4 seconds) at a sampling rate of 16 k Hz. For spectrum computation, we employed a short-time Fourier transform (STFT) with a filter length of 1024, a hop length of 256, and a window size of 1024. All images were resized to 224 × 224 pixels. The audio model in this paper is a wildly used 7-layer U-Net network with k = 32, generating 32 intermediate masks. All models were trained with a batch size of 128, using the Adam optimizer with parameters β1 = 0.9, β2 = 0.999, and ϵ = 10−8, for 200,000 steps. Additionally, we employed warm-up and gradient clipping strategies, following Dong et al. (2022). |