SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models
Authors: Linqin Wang, Yaping Liu, Zhengtao Yu, Shengxiang Gao, Cunli Mao, Yuxin Huang, Wenjun Wang, Ling Dong
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that SECodec performs comparably to En Codec in speech reconstruction, and SESLM surpasses VALL-E in zero-shot text-to-speech tasks. ... 4 Experiment 4.1 Experimental Setup 4.2 Metrics 4.3 Main Results 4.4 Analysis Choice of Nodes and Edges for SE |
| Researcher Affiliation | Academia | Linqin Wang1,2, Yaping Liu1,2, Zhengtao Yu1,2 *, Shengxiang Gao1,2, Cunli Mao1,2, Yuxin Huang1,2, Wenjun Wang1,2, Ling Dong1,2 1 Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China 2 Yunnan Key Laboratory of Artificial Intelligence, Kunming, China EMAIL,EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Codebook construction via hierarchical and disentangled 2D SE minimization. |
| Open Source Code | Yes | Code https://github.com/wlq2019/SECodec |
| Open Datasets | Yes | For SECodec training, we use the Libri Speech (Panayotov et al. 2015) dataset. ... For zero-shot TTS, we train AR and NAR models on the English subset of the Multilingual Libri Speech dataset (Pratap et al. 2020)... For evaluating SESLM, we perform zero-shot text-to-speech assessments using the VCTK (Veaux et al. 2016) dataset... |
| Dataset Splits | Yes | At each training iteration, a 3.2 second segment is randomly cropped from the speech samples. ... We select speech samples with durations ranging from 3 to 14 seconds for the training data. ... For speech reconstruction evaluation, we randomly sampled 300 speech samples from the Libri Speech test set... For evaluating SESLM, we perform zero-shot text-to-speech assessments using the VCTK (Veaux et al. 2016) dataset... For each speaker, we randomly select a 3-second utterance as the prompt and use the text from a different utterance as the input. |
| Hardware Specification | Yes | We train the models on single 3090Ti GPUs with a total batch size of 16. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | During the training stage, we randomly clip a continuous segment of 3.2 seconds from an utterance, which is considered as a training sample. Before being fed into the encoder, the segment undergoes root-mean-square (RMS) normalization. The reconstructed output is rescaled using inverse normalization to calculate losses. We train the models on single 3090Ti GPUs with a total batch size of 16. Under the adversarial training framework, we update the codec model 300,000 times. To prevent the discriminator from becoming too dominant, we only update it when its loss exceeds that of the codec model. |