CR-CTC: Consistency regularization on CTC for improved speech recognition
Authors: Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Libri Speech, Aishell-1, and Giga Speech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). |
| Researcher Affiliation | Industry | Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey Xiaomi Corp., Beijing, China EMAIL |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and descriptive text, such as in Section 3.2 'OUR APPROACH: CONSISTENCY-REGULARIZED CTC' and Section A.1 'SMOOTH-REGULARIZED CTC', but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our code at https://github.com/k2-fsa/icefall. |
| Open Datasets | Yes | Datasets. To evaluate the effectiveness of our proposed CR-CTC, we conduct experiments on three publicly available ASR datasets: 1) Libri Speech (Panayotov et al., 2015), which contains 1000 hours of English speech; 2) Aishell-1 (Bu et al., 2017), which consists of 170 hours of Mandarin speech; 3) Giga Speech (Chen et al., 2021a), comprising 10000 hours of English speech. |
| Dataset Splits | Yes | Datasets. To evaluate the effectiveness of our proposed CR-CTC, we conduct experiments on three publicly available ASR datasets: 1) Libri Speech (Panayotov et al., 2015), which contains 1000 hours of English speech; 2) Aishell-1 (Bu et al., 2017), which consists of 170 hours of Mandarin speech; 3) Giga Speech (Chen et al., 2021a), comprising 10000 hours of English speech. (The paper then presents results tables using standard splits like 'test-clean', 'test-other', 'dev', and 'test' for these datasets). |
| Hardware Specification | Yes | Training configuration, including the number of GPUs and training epochs, on Libri Speech, Aishell1 and Giga Speech datasets are presented in Table 8, Table 9, and Table 10, respectively. These tables include the specification: '80G NVIDIA Tesla A100'. |
| Software Dependencies | No | Our experiments are performed using the icefall framework 2, with Lhotse toolkit ( Zelasko et al., 2021) for data preparation. While frameworks are mentioned, specific version numbers for software libraries or dependencies are not provided. |
| Experiment Setup | Yes | For regular ASR recipes in icefall, default parameter settings of Spec Augment (Park et al., 2019) include a time warping factor of 80, 2 frequency masking regions with a maximum width of 27, and 10 time masking regions with a maximum width of 100, along with a maximum masking fraction of 15% specifically for time masking. In our CR-CTC systems, we utilize larger amount of time masking through increasing both the number of time masking regions and the maximum masking fraction by a factor of 2.5. Speed perturbation (Ko et al., 2015) with factors 0.9, 1.0 and 1.1 is applied... By default, we set α in Equation 3 to 0.2. As CR-CTC requires two forward pass during training, we train CR-CTC models with half the batch size and half the number of epochs compared to CTC models, ensuring a fair com- parison in terms of training cost. ... For CTC and CR-CTC systems, we use prefix search decoding (Graves et al., 2006) with a beam size of 4. |