Heuristic-free Knowledge Distillation for Streaming ASR via Multi-modal Training

Authors: Ji Won Yoon

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on the Libri Speech (Panayotov et al. 2015) benchmark, utilizing two ASR architectures: connectionist temporal classification (CTC) (Graves et al. 2006) and hybrid Transducer-CTC (Noroozi et al. 2024), since both CTC and Transducer (Graves 2012) models have gained favor in real-world streaming settings recently (Wang et al. 2023; Tian et al. 2023a,b; Yao et al. 2021; Zhang et al. 2022; Yu et al. 2021a). Compared to existing KD methods that rely on a powerful offline teacher and the shifting parameter τ, the proposed KD significantly improves the student s performance, achieving the best results in all configurations. It is important to note that Heuristicfree KD does not require any extra teacher model and time shifting by τ. In a detailed case study, we also confirm that the proposed teacher can generate more accurate knowledge for KD while preserving the alignment of the student.
Researcher Affiliation Academia Ji Won Yoon Department of AI, Chung-Ang University, Seoul, South Korea. EMAIL
Pseudocode No The paper describes methods using text and mathematical equations, such as Fenc(x) z, Fdec(z) y, but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code No The paper states: "Our experiments were mainly conducted with the Ne Mo (Kuchaiev et al. 2019) toolkit". This refers to a third-party toolkit used by the authors, not their own source code for the methodology described in the paper. There is no explicit statement or link indicating that the authors' implementation code is open-source or publicly available.
Open Datasets Yes We evaluated the performance of models using the Libri Speech (Panayotov et al. 2015) benchmark, the most widely used ASR dataset, which is freely available under the CC BY 4.0 license. During the training, we employed train-clean-100, train-clean-360, and train-other-500. For evaluations, we utilized dev-clean, dev-other, test-clean, and test-other. The experimental results on the Common Voice 7.0 Spanish dataset (Ardila et al. 2020) can be found in the extended Appendix.
Dataset Splits Yes During the training, we employed train-clean-100, train-clean-360, and train-other-500. For evaluations, we utilized dev-clean, dev-other, test-clean, and test-other.
Hardware Specification No The paper mentions "GPU resources" generally but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments.
Software Dependencies No Our experiments were mainly conducted with the Ne Mo (Kuchaiev et al. 2019) toolkit. The paper mentions the toolkit but does not provide specific version numbers for Ne Mo or any other software libraries or dependencies.
Experiment Setup Yes When training the proposed framework, we considered one tunable parameter λ in Eq. (4). As shown in Figure 5, we evaluated the WER performance while varying the parameter λ from {0.500, 0.250, 0.667, 0.125} From the results, it is verified that the best WER performance on Libri Speech was obtained when λ = 0.250. Two streaming settings were considered: look-aheads of 1040 ms and 480 ms, with corresponding right context sizes of 13 and 6, respectively.