UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

Authors: Alexander Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R Glass, Rafael Valle, Bryan Catanzaro

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed framework, Uni Wav, is the first to present a unified pre-training framework for speech that is efficient on both discriminative and generative tasks. ... In our experiments, we fine-tune the pre-trained model for speech recognition and in-context text-to-speech synthesis, achieving results comparable to state-of-the-art methods in each task. ... Finally, we provide observations and insights on unified pre-training through an ablation study and analysis... 3 EXPERIMENTS Table 1: Comparison of speech foundation models. For speech recognition, we report word error rate (WER) on the test-other subset of the Librispeech (Panayotov et al., 2015) dataset when fine-tuning with 960 hours or 100 hours of labeled data. For speech synthesis, we report ASR-measured WER (ASR-WER) and speaker similarity (Sim.) for in-context text-to-speech. Table 2: Speech tokenization and resynthesis results.
Researcher Affiliation Collaboration Alexander H. Liu1 , Sang-gil Lee2, Chao-Han Huck Yang2, Yuan Gong1 , Yu-Chiang Frank Wang2, James R. Glass1, Rafael Valle2, Bryan Catanzaro2 1MIT CSAIL, 2NVIDIA EMAIL
Pseudocode No The paper describes the methodology using textual explanations, mathematical equations (e.g., Eq. 1-10), and a diagram (Figure 1), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Audio demo page: https://alexander-h-liu.github.io/uniwav-demo.github.io/. This URL points to a demo page, not a source code repository. The paper does not contain an explicit statement or link confirming the release of the source code for the methodology described within the paper itself.
Open Datasets Yes Uni Wav is trained on Libri Light (Kahn et al., 2019), an audiobook corpus with around 60k hours of untranscribed English speech sampled at 16k Hz. ... We fine-tune Uni Wav on Libri Speech to synthesize speech, conditioned on text and audio prompts that provide speaker information.
Dataset Splits Yes We fine-tune our encoder using either 960 hours of transcribed speech from Libri Speech (Panayotov et al., 2015) or 100 hours from its train-clean-100 subset. ... For evaluation, we adapt the protocol introduced by Wang et al. (2023) to perform speaker-conditioned text-to-speech on the test-clean subset with 3-second enrollment. ... All models considered are trained on Libri Speech and tested on the dev and test subsets including both clean and other splits.
Hardware Specification Yes Pre-training is done on 16 H100 GPUs taking around 9 days. ... We fine-tune our model with 1e-5 learning rate for 150k steps with logit-normal time sampling (Esser et al., 2024) on 8 A100 GPUs. ... We fine-tune the decoder with the quantized encoder representation on 8 A100 GPUs for 150k steps with a 1e-4 learning rate.
Software Dependencies No Uni Wav is trained using Adam optimizer (Kingma & Ba, 2014) with a cosine learning rate schedule peaking at 2e-4 for a total of 600k updates, including 10k warmup steps. ... For solving the initial value problems in Eq.(5), we use the midpoint method from torchdiffeq (Chen, 2018). While specific optimizers and numerical methods are mentioned, no version numbers for general software libraries like Python, PyTorch, or CUDA are provided, nor are specific version numbers for `torchdiffeq`.
Experiment Setup Yes Model Both the encoder and the decoder follow the Transformer architecture (Vaswani et al., 2017), with 16 attention heads, hidden size d = 1024, and feed-forward networks with 4096 dimensions. ... We stack 24 layers for the encoder and 12 layers for the decoder. ... For the encoder teacher model, γteacher is set to increase from 0.9997 to 1.0 in the first 400k updates, and γcode is set to 0.9. The top K = 10 layers of the teacher model are considered a learning target for the encoder, and the representation from each layer is clustered to a codebook of size V = 256... For masking, each input frame has an 8% chance of being replaced with a learnable mask embedding. We mask 10 consecutive frames if a frame is sampled to be masked. ... We set the decoder loss weight λ to 0.25. ... Uni Wav is trained using Adam optimizer (Kingma & Ba, 2014) with a cosine learning rate schedule peaking at 2e-4 for a total of 600k updates, including 10k warmup steps. The batch size is 312.5 seconds per GPU, and samples are randomly cropped to cap at 20 seconds. The model is trained with bf16 and gradient clipping of 1.0.