SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models

Authors: Linqin Wang, Yaping Liu, Zhengtao Yu, Shengxiang Gao, Cunli Mao, Yuxin Huang, Wenjun Wang, Ling Dong

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that SECodec performs comparably to En Codec in speech reconstruction, and SESLM surpasses VALL-E in zero-shot text-to-speech tasks. ... 4 Experiment 4.1 Experimental Setup 4.2 Metrics 4.3 Main Results 4.4 Analysis Choice of Nodes and Edges for SE
Researcher Affiliation Academia Linqin Wang1,2, Yaping Liu1,2, Zhengtao Yu1,2 *, Shengxiang Gao1,2, Cunli Mao1,2, Yuxin Huang1,2, Wenjun Wang1,2, Ling Dong1,2 1 Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China 2 Yunnan Key Laboratory of Artificial Intelligence, Kunming, China EMAIL,EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Codebook construction via hierarchical and disentangled 2D SE minimization.
Open Source Code Yes Code https://github.com/wlq2019/SECodec
Open Datasets Yes For SECodec training, we use the Libri Speech (Panayotov et al. 2015) dataset. ... For zero-shot TTS, we train AR and NAR models on the English subset of the Multilingual Libri Speech dataset (Pratap et al. 2020)... For evaluating SESLM, we perform zero-shot text-to-speech assessments using the VCTK (Veaux et al. 2016) dataset...
Dataset Splits Yes At each training iteration, a 3.2 second segment is randomly cropped from the speech samples. ... We select speech samples with durations ranging from 3 to 14 seconds for the training data. ... For speech reconstruction evaluation, we randomly sampled 300 speech samples from the Libri Speech test set... For evaluating SESLM, we perform zero-shot text-to-speech assessments using the VCTK (Veaux et al. 2016) dataset... For each speaker, we randomly select a 3-second utterance as the prompt and use the text from a different utterance as the input.
Hardware Specification Yes We train the models on single 3090Ti GPUs with a total batch size of 16.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes During the training stage, we randomly clip a continuous segment of 3.2 seconds from an utterance, which is considered as a training sample. Before being fed into the encoder, the segment undergoes root-mean-square (RMS) normalization. The reconstructed output is rescaled using inverse normalization to calculate losses. We train the models on single 3090Ti GPUs with a total batch size of 16. Under the adversarial training framework, we update the codec model 300,000 times. To prevent the discriminator from becoming too dominant, we only update it when its loss exceeds that of the codec model.