High Fidelity Neural Audio Compression

Authors: Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 k Hz monophonic and 48 k Hz stereophonic audio. Code and samples are available under github.com/facebookresearch/encodec.
Researcher Affiliation Industry Alexandre Défossez EMAIL Meta AI, FAIR Team, Paris, France Jade Copet EMAIL Meta AI, FAIR Team, Paris, France Gabriel Synnaeve EMAIL Meta AI, FAIR Team, Paris, France Yossi Adi EMAIL Meta AI, FAIR Team, Tel-Aviv, Israel
Pseudocode Yes Algorithm 1 Residual Vector Quantization (RVQ) algorithm
Open Source Code Yes Code and samples are available under github.com/facebookresearch/encodec.
Open Datasets Yes For speech, we use the clean speech segments from DNS Challenge 4 (Dubey et al., 2022) and the Common Voice dataset (Ardila et al., 2019). For general audio, we use on Audio Set (Gemmeke et al., 2017) together with FSD50K (Fonseca et al., 2021). For music, we rely on the Jamendo dataset (Bogdanov et al., 2019) for training and evaluation and we further evaluate our models on music using a proprietary music dataset.
Dataset Splits Yes For Common Voice, we randomly sample 99.5% of the dataset for train, 0.25% for valid and the rest for test splits. Similarly, we sample 98% of the clean segments from DNS Challenge 4 for train, 1% for valid and 1% for test. For Audio Set, we use the unbalanced train segments as training data and randomly selected half of the eval segments as validation set and the other half as test set. We follow the same procedure for FSD50K using the dev set for training and splitting the eval set between validation and test. Finally for the Jamendo dataset, we randomly take 96% of the artists and their corresponding tracks for train, 2% for valid and 2% for test, hence there is no artists overlap in the different sets.
Hardware Specification Yes All the models are traind using 8 A100 GPUs. We profiled all models on a single thread of a Mac Book Pro 2019 CPU at 6 kbps.
Software Dependencies No The paper does not explicitly mention specific software dependencies with version numbers. It mentions the use of Adam optimizer, LSTM, and Transformer models, but without version details.
Experiment Setup Yes We train all models for 300 epochs, with one epoch being 2,000 updates with the Adam optimizer with a batch size of 64 examples of 1 second each, a learning rate of 3 10 4, β1 = 0.5, and β2 = 0.9. All the models are traind using 8 A100 GPUs. We use the balancer introduced in Section 3.4 with weights λt = 0.1, λf = 1, λg = 3, λfeat = 3 for the 24 k Hz models. For the 48 k Hz model, we use instead λg = 4, λfeat = 4.