FlowDec: A flow-based full-band general audio codec with high perceptual quality

Authors: Simon Welker, Matthew Le, Ricky T. Q. Chen, Wei-Ning Hsu, Timo Gerkmann, Alexander Richard, Yi-Chiao Wu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Flow Dec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music. ... We conduct ablation studies on our proposed components. ... high-fidelity perceptual quality competitive with a GAN-based state-of-the-art codec (Kumar et al., 2024), which we confirm with objective metrics and listening tests. ... Section 4 EXPERIMENTAL SETUP ... Section 5 RESULTS
Researcher Affiliation Collaboration Simon Welker , , Matthew Le , Ricky T.Q. Chen , Wei-Ning Hsu , Timo Gerkmann , Alexander Richard , Yi-Chiao Wu Signal Processing, University of Hamburg, 22527 Hamburg, Germany FAIR / Codec Avatar Labs, Meta, 10001 New York / 15222 Pittsburgh, USA
Pseudocode No The paper includes mathematical formulations, equations, and descriptions of methods, but it does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it feature structured steps formatted like code within a dedicated section.
Open Source Code Yes To further ensure reproducibility, we have open-sourced our code for Flow Dec training and inference, along with pretrained model checkpoints of the Flow Dec models listed in this paper, made available at https://github.com/facebookresearch/Flow Dec. A demo page is available at https://sp-uhh.github.io/Flow Dec/.
Open Datasets Yes For underlying codec training, we prepare a varied combination of datasets containing music, speech, and sounds, which are listed in Table 1. ... MSP-Podcast (Lotfian & Busso, 2019) ... Common Voice 13.0* (Ardila et al., 2020) ... Libri TTS (Zen et al., 2019) ... EARS (Richter et al., 2023) ... VCTK 84spk (Valentini-Botinhao, 2017) ... Libri Vox (Kearns, 2014) ... Expresso (Nguyen et al., 2023) ... Wav Caps-Free Sound* (Mei et al., 2024) ... MUSDB18-HQ (Rafii et al., 2019) ... Audio Set (Gemmeke et al., 2017)
Dataset Splits Yes As our test set, we use 3,000 random audio samples with 1,000 of each audio type: 500 files from the VCTK test set (Valentini-Botinhao, 2017) and 500 from the EARS test set (Richter et al., 2024b) for speech, 500 files from MUSDB18-HQ (Rafii et al., 2019) and 500 from Music Caps (Agostinelli et al., 2023) for music, and 1000 files from Audio Set (Gemmeke et al., 2017) for sound. ... We crop audios to a 10-second duration. ... For postfilter training, ... we randomly sample 100,000 clean files x per audio type and crop out segments with a maximum 30-second duration
Hardware Specification Yes We determine the real-time factor (RTF) of the two NDAC variants and the Flow Dec postfilter at NFE {4, 6, 8} with the midpoint solver on an NVIDIA A100-SXM4-80GB GPU.
Software Dependencies No The paper mentions specific software components such as the 'nn Audio Python package' and 'Adam' optimizer. However, it does not provide specific version numbers for any of these software dependencies, which is required for reproducibility.
Experiment Setup Yes For our underlying codecs, ... We train for 800,000 iterations with 0.4 second snippets and a batch size of 72. ... For the CQT loss ... we use a loss weight of 1 for music samples and 0 for audio and speech samples. For the L1 waveform loss, we use a weight of 50. ... We train all postfilters ... We use Adam (Kingma, 2014) at a learning rate of 10 4 for 800,000 iterations, a 2-second snippet duration, and a batch size of 64. We track an exponential moving average (EMA) of the weights with decay 0.999 for inference. For the global variants, we set σy = 0.66. For the frequency-dependent variants, we estimate 768-point frequency curves σy(f) and smooth them with a Gaussian kernel of bandwidth 3. We use the Midpoint solver with 3 steps (NFE = 6) unless otherwise noted