reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Heuristic-free Knowledge Distillation for Streaming ASR via Multi-modal Training

Authors: Ji Won Yoon

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted on the Libri Speech (Panayotov et al. 2015) benchmark, utilizing two ASR architectures: connectionist temporal classification (CTC) (Graves et al. 2006) and hybrid Transducer-CTC (Noroozi et al. 2024), since both CTC and Transducer (Graves 2012) models have gained favor in real-world streaming settings recently (Wang et al. 2023; Tian et al. 2023a,b; Yao et al. 2021; Zhang et al. 2022; Yu et al. 2021a). Compared to existing KD methods that rely on a powerful offline teacher and the shifting parameter τ, the proposed KD significantly improves the student s performance, achieving the best results in all configurations. It is important to note that Heuristicfree KD does not require any extra teacher model and time shifting by τ. In a detailed case study, we also confirm that the proposed teacher can generate more accurate knowledge for KD while preserving the alignment of the student.
Researcher Affiliation	Academia	Ji Won Yoon Department of AI, Chung-Ang University, Seoul, South Korea. EMAIL
Pseudocode	No	The paper describes methods using text and mathematical equations, such as Fenc(x) z, Fdec(z) y, but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code	No	The paper states: "Our experiments were mainly conducted with the Ne Mo (Kuchaiev et al. 2019) toolkit". This refers to a third-party toolkit used by the authors, not their own source code for the methodology described in the paper. There is no explicit statement or link indicating that the authors' implementation code is open-source or publicly available.
Open Datasets	Yes	We evaluated the performance of models using the Libri Speech (Panayotov et al. 2015) benchmark, the most widely used ASR dataset, which is freely available under the CC BY 4.0 license. During the training, we employed train-clean-100, train-clean-360, and train-other-500. For evaluations, we utilized dev-clean, dev-other, test-clean, and test-other. The experimental results on the Common Voice 7.0 Spanish dataset (Ardila et al. 2020) can be found in the extended Appendix.
Dataset Splits	Yes	During the training, we employed train-clean-100, train-clean-360, and train-other-500. For evaluations, we utilized dev-clean, dev-other, test-clean, and test-other.
Hardware Specification	No	The paper mentions "GPU resources" generally but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments.
Software Dependencies	No	Our experiments were mainly conducted with the Ne Mo (Kuchaiev et al. 2019) toolkit. The paper mentions the toolkit but does not provide specific version numbers for Ne Mo or any other software libraries or dependencies.
Experiment Setup	Yes	When training the proposed framework, we considered one tunable parameter λ in Eq. (4). As shown in Figure 5, we evaluated the WER performance while varying the parameter λ from {0.500, 0.250, 0.667, 0.125} From the results, it is verified that the best WER performance on Libri Speech was obtained when λ = 0.250. Two streaming settings were considered: look-aheads of 1040 ms and 480 ms, with corresponding right context sizes of 13 and 6, respectively.