reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Authors: Alan Baade, Puyuan Peng, David Harwath

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves Sot A in syllabic segmentation and clustering. Using these coarse tokens, we successfully train Syllable LM, a Speech Language Model (Speech LM) that matches or outperforms current Sot A Speech LMs on a range of spoken language modeling tasks. Syllable LM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup. We evaluate the effects of training Speech LMs on these new units and obtain state-of-the-art results across a wide-variety of metrics.
Researcher Affiliation	Academia	Alan Baade, Puyuan Peng, David Harwath Department of Computer Science The University of Texas at Austin EMAIL
Pseudocode	No	The paper describes methods like Loss Pred and Syl Boost using mathematical equations and textual explanations, but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Our code and checkpoints are available at https://www.github.com/alanbaade/Syllable LM.
Open Datasets	Yes	We train our tokenizer using Libri Speech (Panayotov et al., 2015), which contains 960 hours of audio books. We train our Speech LMs using all of Libri Light (Kahn et al., 2020), which provides roughly 55k hours of speech.
Dataset Splits	Yes	We train our tokenizer using Libri Speech (Panayotov et al., 2015)... we randomly subsample Libri Speech to a 100 hour train set and train for five epochs and two iterations for all experiments. We train our Speech LMs using all of Libri Light (Kahn et al., 2020)... We use the development and test sets of Libri Speech (Panayotov et al., 2015) and follow the approach from Peng et al. (2023)... To evaluate resynthesized speech, we follow Audio LM and measure Word Error Rate (WER) and Character Error Rate (CER) on the set of 4-10 second segments from Libri Speech test-clean. Like Lakhotia et al. (2021), we generate 10-second continuations from 1000 randomly sampled 3-second crops from Libri Speech test-clean.
Hardware Specification	Yes	We implement all experiments using NVIDIA A40 46GB GPUS with a Intel Xeon Gold 6226R CPU @ 2.90GHz.
Software Dependencies	No	The paper mentions 'Py Torch (Paszke et al., 2019)' and 'CUDA (NVIDIA et al., 2020)' but does not specify the version numbers of these software dependencies used in their experiments.
Experiment Setup	Yes	All of the Speech LMs we implement, as well as our Interleaved-Vocoder-LM, follow the OPT (Zhang et al., 2022) architecture and default to using 12 Transformer layers, an embedding dimension of 768, and learned positional embeddings... For all language model pretraining experiments, we randomly crop files to 25 seconds, use a batch size of 80000 tokens, and train for 200k steps... Additional hyperparameters and hardware details are in Appendix A.4. Appendix A.4 includes Table 9: Speech pre-training hyper-parameters (Layers, Embed Dim, MLP Dim, GPUs, Learning rate, Adam β1 / β2, Weight decay, Dropout, Layer Drop, Warmup updates, Batch size (tokens), Updates, Position Embeddings) and Table 10: Sylboost Parameters (L (One-Indexed), Learning rate, Epochs, Libri Speech Data, Batch Size, Iterations, K-Means clusters (before Agglom.)).