reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR

Authors: Hainan Xu, Travis Bartley, Vladimir Bataev, Boris Ginsburg

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN achieves parity with TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC.
Researcher Affiliation	Industry	Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg NVIDIA Corp., USA EMAIL
Pseudocode	Yes	Algorithm 1 Autoregressive Inference; Algorithm 2 Non-AR Inference; Algorithm 3 Semi-Autoregressive Inference of HAINAN Models; Algorithm 4 Viterbi Decoding of HAINAN Models
Open Source Code	No	We will open-source our implementation and release trained model checkpoints for public use.
Open Datasets	Yes	We train our English models on the combination of Librispeech (Panayotov et al., 2015), Mozilla Common Voice (Ardila et al., 2019), Vox Populi (Wang et al., 2021), Fisher (Cieri et al., 2004), People s Speech (Galvez et al., 2021), Wall Street Journal (Paul & Baker, 1992), National Speech Corpus (Koh et al., 2019), VCTK (Yamagishi et al., 2019), Multilingual Librispeech (Pratap et al., 2020), Europarl (Koehn, 2005) datasets, plus Suno AI datasets.
Dataset Splits	No	The paper uses several standard datasets for training and testing but does not explicitly provide the training/test/validation dataset splits or methodologies used for splitting. It mentions 'validation performance degrades' implying validation sets are used, but without details on their creation or size.
Hardware Specification	Yes	Decoding time (seconds) is measured on only librispeech-test-other using batch=1 and beam=1, running on 2 A6000 GPUs.
Software Dependencies	Yes	All experiments are conducted using the Ne Mo (Kuchaiev et al., 2019) toolkit, version 1.23.0.
Experiment Setup	Yes	All models use Fast Conformer encoders with the first three layers all performing 2X subsampling, therefore 8X subsampling in total. Both TDT and HAINAN models use durations {0, 1, 2, ..., 7, 8}. BPE tokenizer (Sennrich et al., 2016; Kudo & Richardson, 2018) of size 1024 is used for text representation. For all experiments, we let models train for sufficient steps until validation performance degrades (no more than 150k training steps), and run model averaging on 5 best checkpoints to generate the final model for evaluation. The encoder uses Fast Conformer-XXL architecture with 42 layers of conformer blocks, each of which uses 8 heads of self-attention layers with model hidden dimension = 1024, totaling around 1.1b parameters. The convolutions in the conformers use kernel size = 9. For standard RNN-T, TDT and HAINAN models, their predictors consist of 2-layer LSTMs, with hidden dimension = 640. The joint network is a 2-layer feed-forward network with ReLU in between and with hidden dimension 1024.