SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Authors: Alan Baade, Puyuan Peng, David Harwath

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves Sot A in syllabic segmentation and clustering. Using these coarse tokens, we successfully train Syllable LM, a Speech Language Model (Speech LM) that matches or outperforms current Sot A Speech LMs on a range of spoken language modeling tasks. Syllable LM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup. We evaluate the effects of training Speech LMs on these new units and obtain state-of-the-art results across a wide-variety of metrics.
Researcher Affiliation Academia Alan Baade, Puyuan Peng, David Harwath Department of Computer Science The University of Texas at Austin EMAIL
Pseudocode No The paper describes methods like Loss Pred and Syl Boost using mathematical equations and textual explanations, but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code and checkpoints are available at https://www.github.com/alanbaade/Syllable LM.
Open Datasets Yes We train our tokenizer using Libri Speech (Panayotov et al., 2015), which contains 960 hours of audio books. We train our Speech LMs using all of Libri Light (Kahn et al., 2020), which provides roughly 55k hours of speech.
Dataset Splits Yes We train our tokenizer using Libri Speech (Panayotov et al., 2015)... we randomly subsample Libri Speech to a 100 hour train set and train for five epochs and two iterations for all experiments. We train our Speech LMs using all of Libri Light (Kahn et al., 2020)... We use the development and test sets of Libri Speech (Panayotov et al., 2015) and follow the approach from Peng et al. (2023)... To evaluate resynthesized speech, we follow Audio LM and measure Word Error Rate (WER) and Character Error Rate (CER) on the set of 4-10 second segments from Libri Speech test-clean. Like Lakhotia et al. (2021), we generate 10-second continuations from 1000 randomly sampled 3-second crops from Libri Speech test-clean.
Hardware Specification Yes We implement all experiments using NVIDIA A40 46GB GPUS with a Intel Xeon Gold 6226R CPU @ 2.90GHz.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2019)' and 'CUDA (NVIDIA et al., 2020)' but does not specify the version numbers of these software dependencies used in their experiments.
Experiment Setup Yes All of the Speech LMs we implement, as well as our Interleaved-Vocoder-LM, follow the OPT (Zhang et al., 2022) architecture and default to using 12 Transformer layers, an embedding dimension of 768, and learned positional embeddings... For all language model pretraining experiments, we randomly crop files to 25 seconds, use a batch size of 80000 tokens, and train for 200k steps... Additional hyperparameters and hardware details are in Appendix A.4. Appendix A.4 includes Table 9: Speech pre-training hyper-parameters (Layers, Embed Dim, MLP Dim, GPUs, Learning rate, Adam β1 / β2, Weight decay, Dropout, Layer Drop, Warmup updates, Batch size (tokens), Updates, Position Embeddings) and Table 10: Sylboost Parameters (L (One-Indexed), Learning rate, Epochs, Libri Speech Data, Batch Size, Iterations, K-Means clusters (before Agglom.)).