Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR
Authors: Hainan Xu, Travis Bartley, Vladimir Bataev, Boris Ginsburg
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN achieves parity with TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. |
| Researcher Affiliation | Industry | Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg NVIDIA Corp., USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Autoregressive Inference; Algorithm 2 Non-AR Inference; Algorithm 3 Semi-Autoregressive Inference of HAINAN Models; Algorithm 4 Viterbi Decoding of HAINAN Models |
| Open Source Code | No | We will open-source our implementation and release trained model checkpoints for public use. |
| Open Datasets | Yes | We train our English models on the combination of Librispeech (Panayotov et al., 2015), Mozilla Common Voice (Ardila et al., 2019), Vox Populi (Wang et al., 2021), Fisher (Cieri et al., 2004), People s Speech (Galvez et al., 2021), Wall Street Journal (Paul & Baker, 1992), National Speech Corpus (Koh et al., 2019), VCTK (Yamagishi et al., 2019), Multilingual Librispeech (Pratap et al., 2020), Europarl (Koehn, 2005) datasets, plus Suno AI datasets. |
| Dataset Splits | No | The paper uses several standard datasets for training and testing but does not explicitly provide the training/test/validation dataset splits or methodologies used for splitting. It mentions 'validation performance degrades' implying validation sets are used, but without details on their creation or size. |
| Hardware Specification | Yes | Decoding time (seconds) is measured on only librispeech-test-other using batch=1 and beam=1, running on 2 A6000 GPUs. |
| Software Dependencies | Yes | All experiments are conducted using the Ne Mo (Kuchaiev et al., 2019) toolkit, version 1.23.0. |
| Experiment Setup | Yes | All models use Fast Conformer encoders with the first three layers all performing 2X subsampling, therefore 8X subsampling in total. Both TDT and HAINAN models use durations {0, 1, 2, ..., 7, 8}. BPE tokenizer (Sennrich et al., 2016; Kudo & Richardson, 2018) of size 1024 is used for text representation. For all experiments, we let models train for sufficient steps until validation performance degrades (no more than 150k training steps), and run model averaging on 5 best checkpoints to generate the final model for evaluation. The encoder uses Fast Conformer-XXL architecture with 42 layers of conformer blocks, each of which uses 8 heads of self-attention layers with model hidden dimension = 1024, totaling around 1.1b parameters. The convolutions in the conformers use kernel size = 9. For standard RNN-T, TDT and HAINAN models, their predictors consist of 2-layer LSTMs, with hidden dimension = 640. The joint network is a 2-layer feed-forward network with ReLU in between and with hidden dimension 1024. |