reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

High Fidelity Neural Audio Compression

Authors: Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 k Hz monophonic and 48 k Hz stereophonic audio. Code and samples are available under github.com/facebookresearch/encodec.
Researcher Affiliation	Industry	Alexandre Défossez EMAIL Meta AI, FAIR Team, Paris, France Jade Copet EMAIL Meta AI, FAIR Team, Paris, France Gabriel Synnaeve EMAIL Meta AI, FAIR Team, Paris, France Yossi Adi EMAIL Meta AI, FAIR Team, Tel-Aviv, Israel
Pseudocode	Yes	Algorithm 1 Residual Vector Quantization (RVQ) algorithm
Open Source Code	Yes	Code and samples are available under github.com/facebookresearch/encodec.
Open Datasets	Yes	For speech, we use the clean speech segments from DNS Challenge 4 (Dubey et al., 2022) and the Common Voice dataset (Ardila et al., 2019). For general audio, we use on Audio Set (Gemmeke et al., 2017) together with FSD50K (Fonseca et al., 2021). For music, we rely on the Jamendo dataset (Bogdanov et al., 2019) for training and evaluation and we further evaluate our models on music using a proprietary music dataset.
Dataset Splits	Yes	For Common Voice, we randomly sample 99.5% of the dataset for train, 0.25% for valid and the rest for test splits. Similarly, we sample 98% of the clean segments from DNS Challenge 4 for train, 1% for valid and 1% for test. For Audio Set, we use the unbalanced train segments as training data and randomly selected half of the eval segments as validation set and the other half as test set. We follow the same procedure for FSD50K using the dev set for training and splitting the eval set between validation and test. Finally for the Jamendo dataset, we randomly take 96% of the artists and their corresponding tracks for train, 2% for valid and 2% for test, hence there is no artists overlap in the different sets.
Hardware Specification	Yes	All the models are traind using 8 A100 GPUs. We profiled all models on a single thread of a Mac Book Pro 2019 CPU at 6 kbps.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers. It mentions the use of Adam optimizer, LSTM, and Transformer models, but without version details.
Experiment Setup	Yes	We train all models for 300 epochs, with one epoch being 2,000 updates with the Adam optimizer with a batch size of 64 examples of 1 second each, a learning rate of 3 10 4, β1 = 0.5, and β2 = 0.9. All the models are traind using 8 A100 GPUs. We use the balancer introduced in Section 3.4 with weights λt = 0.1, λf = 1, λg = 3, λfeat = 3 for the 24 k Hz models. For the 48 k Hz model, we use instead λg = 4, λfeat = 4.