Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Authors: Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen, Alan Black, Gopala Anumanchipalli

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we propose a novel SSL framework that induces clean and robust syllabic structures in speech representations. Specifically, we build on top of a previous self-supervised syllable learning model, SDHu BERT (Cho et al., 2024b), and iteratively refine the syllabic segments that naturally arise from the model. ... Sylber outperforms previous approaches in syllable detection and discovery with a more efficient segmentation algorithm with O(n) time complexity. ... We use this model to build a new dynamic speech tokenization scheme that has significantly lower sampling rate as 4.27 Tok/s on average, 6-7 times improvement over Hu BERT tokens. ... We demonstrate that fully intelligible speech can be reconstructed from syllabic tokens, and that these units are suited for lexical and syntactic understanding. ... We demonstrate that categorical perception arises in Sylber, projecting audio to a more categorical embedding space than previous SSL models.
Researcher Affiliation Academia 1University of California, Berkeley, 2Carnegie Mellon University
Pseudocode Yes A.1.2 GREEDY SEGMENTATION ALGORITHM Algorithm 1 Greedy Segmentation Algorithm
Open Source Code Yes 1The code is available here: https://github.com/Berkeley-Speech-Group/sylber.
Open Datasets Yes Datasets Libri Speech (Panayotov et al., 2015) is used for training Sylber, and k-means clustering. For training the u LMs, we use either Libri Speech or Libri Light (Kahn et al., 2020), and separately report performance. Libri TTS-R (Koizumi et al., 2023) is used for training the CFM models. ... We use the Fisher corpus Cieri et al. (2004), an English conversational dataset... We used a Spanish subset of Multilingual Libri Speech (MLS) (Pratap et al., 2020) and AISHELL-3 (Shi et al., 2021) for Mandarin.
Dataset Splits Yes For training the u LMs, we use either Libri Speech or Libri Light (Kahn et al., 2020), and separately report performance. ... These are evaluated on the test-clean split of Libri TTS-R. ... When training on Libri Light, we use 96% of the data for training and 2% each is held out for validation and test. ... We use the syllable labels of dev and test splits of Libri Speech created by forced aligned phonemes and syllabification of them.
Hardware Specification Yes For training SPARC, we use Libri TTS-R using a single A5000-24GB GPU. ... with a single A6000-48GB GPU. ... We used a single A6000-48GB GPU for training. ... We use a single A6000-48GB GPU for training u LMs on Libri Speech and two of them for training on Libri Light. ... Every experiment was run on a single A6000-48GB GPU with 2 AMD EPYC 7513 32-Core Processor.
Software Dependencies No The paper mentions several software tools like 'k-means clustering', 'SentencePiece', 'Whisper (Radford et al., 2023)', 'RNN-T implementation in PyTorch', 'S3PRL', 'Montreal Forced Aligner (MFA) (Mc Auliffe et al., 2017)', 'script by Gorman (2013)', and 'Silabeador (Sanz-L azaro)'. However, specific version numbers for these software components, which would be necessary for full reproducibility, are not explicitly provided in the main text or appendices.
Experiment Setup Yes Sylber has the same architecture as Hu BERT with a CNN feature extractor followed by Transformer encoder. Based on the observation that the ninth layer of SDHu BERT best encodes syllables (Cho et al. (2024b)), we use a 9 layer transformer and initialize weights with SDHu BERT up-to that layer. ... The model is trained for 115K steps in the first stage and further trained for 50K steps in the second stage. We use a batch size of 64 and each data point is randomly cropped to be 5 seconds... The learning rate is set as 1e-4 with initial 500 warmup updates for the first stage and 5e-5 for the second stage. ... For training, the learning rate is fixed as 1e-4, with a batch size of 64 and 200k updates.