Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

Authors: Hainan Xu, Travis Bartley, Vladimir Bataev, Boris Ginsburg

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN achieves parity with TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC.
Researcher Affiliation Industry Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg NVIDIA Corp., USA EMAIL
Pseudocode Yes Algorithm 1 Autoregressive Inference; Algorithm 2 Non-AR Inference; Algorithm 3 Semi-Autoregressive Inference of HAINAN Models; Algorithm 4 Viterbi Decoding of HAINAN Models
Open Source Code No We will open-source our implementation and release trained model checkpoints for public use.
Open Datasets Yes We train our English models on the combination of Librispeech (Panayotov et al., 2015), Mozilla Common Voice (Ardila et al., 2019), Vox Populi (Wang et al., 2021), Fisher (Cieri et al., 2004), People s Speech (Galvez et al., 2021), Wall Street Journal (Paul & Baker, 1992), National Speech Corpus (Koh et al., 2019), VCTK (Yamagishi et al., 2019), Multilingual Librispeech (Pratap et al., 2020), Europarl (Koehn, 2005) datasets, plus Suno AI datasets.
Dataset Splits No The paper uses several standard datasets for training and testing but does not explicitly provide the training/test/validation dataset splits or methodologies used for splitting. It mentions 'validation performance degrades' implying validation sets are used, but without details on their creation or size.
Hardware Specification Yes Decoding time (seconds) is measured on only librispeech-test-other using batch=1 and beam=1, running on 2 A6000 GPUs.
Software Dependencies Yes All experiments are conducted using the Ne Mo (Kuchaiev et al., 2019) toolkit, version 1.23.0.
Experiment Setup Yes All models use Fast Conformer encoders with the first three layers all performing 2X subsampling, therefore 8X subsampling in total. Both TDT and HAINAN models use durations {0, 1, 2, ..., 7, 8}. BPE tokenizer (Sennrich et al., 2016; Kudo & Richardson, 2018) of size 1024 is used for text representation. For all experiments, we let models train for sufficient steps until validation performance degrades (no more than 150k training steps), and run model averaging on 5 best checkpoints to generate the final model for evaluation. The encoder uses Fast Conformer-XXL architecture with 42 layers of conformer blocks, each of which uses 8 heads of self-attention layers with model hidden dimension = 1024, totaling around 1.1b parameters. The convolutions in the conformers use kernel size = 9. For standard RNN-T, TDT and HAINAN models, their predictors consist of 2-layer LSTMs, with hidden dimension = 640. The joint network is a 2-layer feed-forward network with ReLU in between and with hidden dimension 1024.