SyllableLM: Learning Coarse Semantic Units for Speech Language Models
Authors: Alan Baade, Puyuan Peng, David Harwath
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves Sot A in syllabic segmentation and clustering. Using these coarse tokens, we successfully train Syllable LM, a Speech Language Model (Speech LM) that matches or outperforms current Sot A Speech LMs on a range of spoken language modeling tasks. Syllable LM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup. We evaluate the effects of training Speech LMs on these new units and obtain state-of-the-art results across a wide-variety of metrics. |
| Researcher Affiliation | Academia | Alan Baade, Puyuan Peng, David Harwath Department of Computer Science The University of Texas at Austin EMAIL |
| Pseudocode | No | The paper describes methods like Loss Pred and Syl Boost using mathematical equations and textual explanations, but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our code and checkpoints are available at https://www.github.com/alanbaade/Syllable LM. |
| Open Datasets | Yes | We train our tokenizer using Libri Speech (Panayotov et al., 2015), which contains 960 hours of audio books. We train our Speech LMs using all of Libri Light (Kahn et al., 2020), which provides roughly 55k hours of speech. |
| Dataset Splits | Yes | We train our tokenizer using Libri Speech (Panayotov et al., 2015)... we randomly subsample Libri Speech to a 100 hour train set and train for five epochs and two iterations for all experiments. We train our Speech LMs using all of Libri Light (Kahn et al., 2020)... We use the development and test sets of Libri Speech (Panayotov et al., 2015) and follow the approach from Peng et al. (2023)... To evaluate resynthesized speech, we follow Audio LM and measure Word Error Rate (WER) and Character Error Rate (CER) on the set of 4-10 second segments from Libri Speech test-clean. Like Lakhotia et al. (2021), we generate 10-second continuations from 1000 randomly sampled 3-second crops from Libri Speech test-clean. |
| Hardware Specification | Yes | We implement all experiments using NVIDIA A40 46GB GPUS with a Intel Xeon Gold 6226R CPU @ 2.90GHz. |
| Software Dependencies | No | The paper mentions 'Py Torch (Paszke et al., 2019)' and 'CUDA (NVIDIA et al., 2020)' but does not specify the version numbers of these software dependencies used in their experiments. |
| Experiment Setup | Yes | All of the Speech LMs we implement, as well as our Interleaved-Vocoder-LM, follow the OPT (Zhang et al., 2022) architecture and default to using 12 Transformer layers, an embedding dimension of 768, and learned positional embeddings... For all language model pretraining experiments, we randomly crop files to 25 seconds, use a batch size of 80000 tokens, and train for 200k steps... Additional hyperparameters and hardware details are in Appendix A.4. Appendix A.4 includes Table 9: Speech pre-training hyper-parameters (Layers, Embed Dim, MLP Dim, GPUs, Learning rate, Adam β1 / β2, Weight decay, Dropout, Layer Drop, Warmup updates, Batch size (tokens), Updates, Position Embeddings) and Table 10: Sylboost Parameters (L (One-Indexed), Learning rate, Epochs, Libri Speech Data, Batch Size, Iterations, K-Means clusters (before Agglom.)). |