reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CR-CTC: Consistency regularization on CTC for improved speech recognition

Authors: Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Libri Speech, Aishell-1, and Giga Speech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED).
Researcher Affiliation	Industry	Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey Xiaomi Corp., Beijing, China EMAIL
Pseudocode	No	The paper describes the methodology using mathematical formulations and descriptive text, such as in Section 3.2 'OUR APPROACH: CONSISTENCY-REGULARIZED CTC' and Section A.1 'SMOOTH-REGULARIZED CTC', but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release our code at https://github.com/k2-fsa/icefall.
Open Datasets	Yes	Datasets. To evaluate the effectiveness of our proposed CR-CTC, we conduct experiments on three publicly available ASR datasets: 1) Libri Speech (Panayotov et al., 2015), which contains 1000 hours of English speech; 2) Aishell-1 (Bu et al., 2017), which consists of 170 hours of Mandarin speech; 3) Giga Speech (Chen et al., 2021a), comprising 10000 hours of English speech.
Dataset Splits	Yes	Datasets. To evaluate the effectiveness of our proposed CR-CTC, we conduct experiments on three publicly available ASR datasets: 1) Libri Speech (Panayotov et al., 2015), which contains 1000 hours of English speech; 2) Aishell-1 (Bu et al., 2017), which consists of 170 hours of Mandarin speech; 3) Giga Speech (Chen et al., 2021a), comprising 10000 hours of English speech. (The paper then presents results tables using standard splits like 'test-clean', 'test-other', 'dev', and 'test' for these datasets).
Hardware Specification	Yes	Training configuration, including the number of GPUs and training epochs, on Libri Speech, Aishell1 and Giga Speech datasets are presented in Table 8, Table 9, and Table 10, respectively. These tables include the specification: '80G NVIDIA Tesla A100'.
Software Dependencies	No	Our experiments are performed using the icefall framework 2, with Lhotse toolkit ( Zelasko et al., 2021) for data preparation. While frameworks are mentioned, specific version numbers for software libraries or dependencies are not provided.
Experiment Setup	Yes	For regular ASR recipes in icefall, default parameter settings of Spec Augment (Park et al., 2019) include a time warping factor of 80, 2 frequency masking regions with a maximum width of 27, and 10 time masking regions with a maximum width of 100, along with a maximum masking fraction of 15% specifically for time masking. In our CR-CTC systems, we utilize larger amount of time masking through increasing both the number of time masking regions and the maximum masking fraction by a factor of 2.5. Speed perturbation (Ko et al., 2015) with factors 0.9, 1.0 and 1.1 is applied... By default, we set α in Equation 3 to 0.2. As CR-CTC requires two forward pass during training, we train CR-CTC models with half the batch size and half the number of epochs compared to CTC models, ensuring a fair com- parison in terms of training cost. ... For CTC and CR-CTC systems, we use prefix search decoding (Graves et al., 2006) with a beam size of 4.