Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Authors: Julian Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski, Zach Evans, Xubo Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The trained models strongly out-perform existing baselines in both objective and subjective tests. We demonstrate how these contributions enable the training of a waveform codec model that achieves high compression for speech, with ultra-low bitrates of 400 bps and 700 bps, while still preserving good audio quality. We evaluate two variations of our model, with different post-hoc configurations of the FSQ bottleneck. Results of the evaluation with the proposed objective metrics are given in Tab. 2. The results of the MUSHRA subjective test, shown in Fig. 2 indicate that TAAE obtains state-of-the-art results outperforming, by a significant margin, recently published speech codecs. Importantly, the proposed model obtains results that are close to the ground truth. Comparing these evaluation results with the baseline model sizes shown in Tab. 12 indicates the potential of scaling transformer-based codec architectures to achieve new benchmarks in terms of speech quality and compression.
Researcher Affiliation Industry Julian D. Parker Anton Smirnov Jordi Pons CJ Carr Zack Zukowski Zach Evans Xubo Liu Stability AI EMAIL
Pseudocode No The paper describes the architecture and methods in detail using natural language and diagrams (e.g., Fig. 1), but it does not contain any clearly labeled pseudocode or algorithm blocks. It explains the processes and mathematical formulations in paragraph form.
Open Source Code Yes Code and models will be released at: github.com/Stability-AI/stable-codec.
Open Datasets Yes For training speech codec models, we use the Librilight dataset (60k hours) and the English portion of the Multilingual Libri Speech (MLS) dataset (45k hours). Both datasets contain 16 k Hz original speech data, amounting to a total of approximately 105k hours of training data. For evaluation, we utilize the test-clean subset of Libri Speech for speech data, selecting audio clips with durations ranging from 5 to 10 seconds to create a test set of 900 clean speech samples at 16 k Hz.
Dataset Splits Yes For training speech codec models, we use the Librilight dataset (60k hours) and the English portion of the Multilingual Libri Speech (MLS) dataset (45k hours). Both datasets contain 16 k Hz original speech data, amounting to a total of approximately 105k hours of training data. For evaluation, we utilize the test-clean subset of Libri Speech for speech data, selecting audio clips with durations ranging from 5 to 10 seconds to create a test set of 900 clean speech samples at 16 k Hz. This is consistent with a training set (Librilight + MLS) and a test set (Libri Speech test-clean subset) split. The test set size and characteristics are specified.
Hardware Specification Yes 16 H100 GPUs are utilized, with an effective batch size of 128.
Software Dependencies No The Adam W optimizer is used for both the autoencoder and discriminator, both with a learning rate of 0.0008. The autoencoder additionally uses weight decay with a coefficient of 0.01. Data is randomly chunked into segments of 5.12 seconds for training. 16 H100 GPUs are utilized, with an effective batch size of 128. Pretraining is conducted for 500k steps, with a decay coefficient of γ = 0.9999 applied to the reconstruction losses. The STFT loss utilizes 2048 bins, a hop size of 512 and a Hanning window. The finetuning stage is conducted for a further 150k steps using the Wav LM-Large perceptual reconstruction loss in addition to the adversarial feature-matching loss. In both stages, all loss terms are weighted equally.
Experiment Setup Yes The codec model is configured with a patch size of 320 samples at the input. There are two encoder blocks. One directly follows the patching, and contains 8 transformer blocks. This is followed by a further encoder block performing 2x downsampling, which contains 20 transformer blocks. The embedding dimension of the transformer blocks is 1024, whilst the reverse bottleneck of the feedforward layer is 4x larger. The head dimension of the self-attention block is 128. Layer norms are configured with ϵ = 1 10 2, and the sliding attention window is of size 128. The decoder is configured to be symmetrical with the encoder. The resulting model has approximately 950M parameters. The bottleneck is 6 dimensional and trained with 17, 9 and 5 levels for every dimension, randomly chosen. The ensemble discriminator is configured as described in Appendix B.5, with each discriminator having a channel count of 256. We use Flash Attention (Dao et al., 2022) to ensure computational efficiency. The model is trained with FP16 mixed precision. The Adam W optimizer is used for both the autoencoder and discriminator, both with a learning rate of 0.0008. The autoencoder additionally uses weight decay with a coefficient of 0.01. Data is randomly chunked into segments of 5.12 seconds for training. 16 H100 GPUs are utilized, with an effective batch size of 128. Pretraining is conducted for 500k steps, with a decay coefficient of γ = 0.9999 applied to the reconstruction losses. The STFT loss utilizes 2048 bins, a hop size of 512 and a Hanning window. The finetuning stage is conducted for a further 150k steps using the Wav LM-Large perceptual reconstruction loss in addition to the adversarial feature-matching loss. In both stages, all loss terms are weighted equally.