reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models

Authors: Linqin Wang, Yaping Liu, Zhengtao Yu, Shengxiang Gao, Cunli Mao, Yuxin Huang, Wenjun Wang, Ling Dong

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that SECodec performs comparably to En Codec in speech reconstruction, and SESLM surpasses VALL-E in zero-shot text-to-speech tasks. ... 4 Experiment 4.1 Experimental Setup 4.2 Metrics 4.3 Main Results 4.4 Analysis Choice of Nodes and Edges for SE
Researcher Affiliation	Academia	Linqin Wang1,2, Yaping Liu1,2, Zhengtao Yu1,2 *, Shengxiang Gao1,2, Cunli Mao1,2, Yuxin Huang1,2, Wenjun Wang1,2, Ling Dong1,2 1 Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China 2 Yunnan Key Laboratory of Artificial Intelligence, Kunming, China EMAIL,EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Codebook construction via hierarchical and disentangled 2D SE minimization.
Open Source Code	Yes	Code https://github.com/wlq2019/SECodec
Open Datasets	Yes	For SECodec training, we use the Libri Speech (Panayotov et al. 2015) dataset. ... For zero-shot TTS, we train AR and NAR models on the English subset of the Multilingual Libri Speech dataset (Pratap et al. 2020)... For evaluating SESLM, we perform zero-shot text-to-speech assessments using the VCTK (Veaux et al. 2016) dataset...
Dataset Splits	Yes	At each training iteration, a 3.2 second segment is randomly cropped from the speech samples. ... We select speech samples with durations ranging from 3 to 14 seconds for the training data. ... For speech reconstruction evaluation, we randomly sampled 300 speech samples from the Libri Speech test set... For evaluating SESLM, we perform zero-shot text-to-speech assessments using the VCTK (Veaux et al. 2016) dataset... For each speaker, we randomly select a 3-second utterance as the prompt and use the text from a different utterance as the input.
Hardware Specification	Yes	We train the models on single 3090Ti GPUs with a total batch size of 16.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	During the training stage, we randomly clip a continuous segment of 3.2 seconds from an utterance, which is considered as a training sample. Before being fed into the encoder, the segment undergoes root-mean-square (RMS) normalization. The reconstructed output is rescaled using inverse normalization to calculate losses. We train the models on single 3090Ti GPUs with a total batch size of 16. Under the adversarial training framework, we update the codec model 300,000 times. To prevent the discriminator from becoming too dominant, we only update it when its loss exceeds that of the codec model.