RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

Authors: Peng Liu, Dongyang Dai, Zhiyong Wu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 160 times faster than real-time on a GPU. Both an online demonstration and the source code are accessible1. We evaluate RFWave using both Mel-spectrograms and discrete En Codec tokens as inputs. For Mel-spectrogram inputs, we first benchmark RFWave against existing diffusion vocoders to demonstrate its superiority. We then compare it with widely-used GAN models to highlight its practical applicability and advantages. For discrete En Codec tokens, we evaluate RFWave s efficiency in reconstructing high-quality audio from compressed representations across diverse domains. Finally, we conduct ablation studies and further analysis to examine the effects of the individual components of RFWave.
Researcher Affiliation Collaboration Peng Liu TEX AI, Transsion EMAIL Dongyang Dai Individual Researcher EMAIL Zhiyong Wu Shenzhen International Graduate School, Tsinghua University EMAIL
Pseudocode Yes Sampling algorithms for the two distinct approaches, one in the time domain and the other in the frequency domain, are provided in Appendix Section A.9.1. The implementation details are provided in Appendix Section A.5, while the algorithm is presented in Appendix Section A.9.2. A.9.1 SAMPLING ALGORITHM. Algorithm 1 Simplified Sampling Algorithm (Xt in Time Domain). Detailed Algorithm 1: Sample Time Domain. Algorithm 2 Simplified Sampling Algorithm (Xt in Frequency Domain). Detailed Algorithm 2: Sample Frequency Domain. A.9.2 SELECTING TIME POINTS OF EQUAL STRAIGHTNESS. Detailed Algorithm 3: Select Time Points.
Open Source Code Yes Both an online demonstration and the source code are accessible1. 1Demo: https://rfwave-demo.github.io/rfwave; Code: https://github.com/bfs18/rfwave
Open Datasets Yes For Mel-spectrogram inputs, we conduct two evaluations. When benchmarking against diffusion vocoders, we train separate models on Libri TTS (Zen et al., 2019) (speech), MTG-Jamendo (Bogdanov et al., 2019) (music), and Opencpop (Wang et al., 2022) (vocal) datasets and test each model on its respective dataset to ensure comprehensive comparison across various audio categories. When comparing RFWave to widely used GAN-based models, we train a model on Libri TTS and evaluate its in-domain performance on the Libri TTS test set. Additionally, we assess the out-ofdomain generalization ability of this Libri TTS-trained model by testing it on the MUSDB18 (Rafii et al., 2017) test subset. For discrete En Codec token inputs, we follow convention by training a universal model on a largescale dataset. This dataset combines Common Voice 7.0 (Ardila et al., 2019) and clean data from DNS Challenge 4 (Dubey et al., 2022) for speech, MTG-Jamendo (Bogdanov et al., 2019) for music, and FSD50K (Fonseca et al., 2021) and Audio Set (Gemmeke et al., 2017) for environmental sounds.
Dataset Splits Yes When benchmarking against diffusion vocoders, we train separate models on Libri TTS (Zen et al., 2019) (speech), MTG-Jamendo (Bogdanov et al., 2019) (music), and Opencpop (Wang et al., 2022) (vocal) datasets and test each model on its respective dataset to ensure comprehensive comparison across various audio categories. When comparing RFWave to widely used GAN-based models, we train a model on Libri TTS and evaluate its in-domain performance on the Libri TTS test set. Additionally, we assess the out-ofdomain generalization ability of this Libri TTS-trained model by testing it on the MUSDB18 (Rafii et al., 2017) test subset. For discrete En Codec token inputs, we follow convention by training a universal model on a largescale dataset... we constructed a unified evaluation dataset comprising 900 test audio samples from 15 external datasets, covering speech, vocals, and sound effects. Detailed information about this test set is provided in Table A.4.
Hardware Specification Yes RFWave achieves speeds up to 160 times faster than real-time on an NVIDIA Ge Force RTX 4090 GPU. We perform inference speed benchmark tests using an NVIDIA Ge Force RTX 4090 GPU. The bulk of the experiments were carried out on personal computers and GPU servers, which were sourced from cloud service providers. Primarily, we utilized Nvidia-4090-24G and Nvidia-A100-80G GPUs for these tasks.
Software Dependencies No The implementation was done in Py Torch (Paszke et al., 2019), and no specific hardware optimizations were applied.
Experiment Setup Yes Audio samples are randomly cropped to lengths of 32512 and 65024 for 22.05/24 k Hz and 44.1 k Hz waveforms, respectively. This is equivalent to a crop window of 128 frames for both sampling rates. We use a batch size of 64. The model optimization is performed using the Adam W optimizer with a starting learning rate of 2e-4 and beta parameters of (0.9, 0.999). A cosine annealing schedule is applied to reduce the learning rate to a minimum of 2e-6.