reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

Authors: Xize Cheng, Siqi Zheng, zehan wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of Omni Sep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks.
Researcher Affiliation	Collaboration	1Zhejiang University 2Alibaba Group
Pseudocode	No	The paper describes the methodology in prose and mathematical equations but does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The code and models will be released, and the main contributions of this paper are as follows:
Open Datasets	Yes	We conducted experiments on two datasets, VGGSOUND (Chen et al., 2020) and MUSIC (Zhao et al., 2018)
Dataset Splits	Yes	For comparison with previous studies, we conducted sound separation experiments following the CLIPSEP. This entailed training our model on the MUSIC dataset and evaluating its performance on itself. Additionally, we trained the model on the VGGSOUND dataset and evaluated its performance on the VGGSOUND-CLEAN+ and MUSIC-CLEAN+ datasets, which contain manually processed clean sound separation evaluation samples.
Hardware Specification	Yes	All experiments were conducted on a single A800 GPU.
Software Dependencies	No	The paper mentions software like 'Adam optimizer' and 'museval' for SDR computation but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	For all audio samples, we conducted experiments on samples of length 65535 (approximately 4 seconds) at a sampling rate of 16 k Hz. For spectrum computation, we employed a short-time Fourier transform (STFT) with a filter length of 1024, a hop length of 256, and a window size of 1024. All images were resized to 224 × 224 pixels. The audio model in this paper is a wildly used 7-layer U-Net network with k = 32, generating 32 intermediate masks. All models were trained with a batch size of 128, using the Adam optimizer with parameters β1 = 0.9, β2 = 0.999, and ϵ = 10−8, for 200,000 steps. Additionally, we employed warm-up and gradient clipping strategies, following Dong et al. (2022).