Kernel Approximation Methods for Speech Recognition

Authors: Avner May, Alireza Bagheri Garakani, Zhiyun Lu, Dong Guo, Kuan Liu, Aurélien Bellet, Linxi Fan, Michael Collins, Daniel Hsu, Brian Kingsbury, Michael Picheny, Fei Sha

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the performance of kernel methods on the acoustic modeling task for automatic speech recognition, and compare their performance to deep neural networks (DNNs). ... Leveraging these three methods, the kernel methods attain token error rates between 0.5% better and 0.1% worse than fully-connected DNNs across four speech recognition data sets, including the TIMIT and Broadcast News benchmark tasks. ... In Section 6, we report extensive experiments comparing DNNs and kernel methods, including results using the methods discussed above.
Researcher Affiliation Collaboration 1Department of Computer Science, Stanford University, Stanford, CA 94305, USA 2Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA 3INRIA, 40 Avenue Halley, 59650 Villeneuve d Ascq, France 4Department of Computer Science, Columbia University, New York, NY 10027, USA 5Google Inc, USA 6IBM Research AI, Yorktown Heights, NY 10598, USA
Pseudocode Yes Algorithm 1 Random feature selection
Open Source Code No The paper does not provide concrete access to source code. It mentions 'All our training code is written in MATLAB' but does not provide a link or an explicit statement about making the code publicly available.
Open Datasets Yes We use the IARPA Babel Program Cantonese (IARPA-babel101-v0.4c) and Bengali (IARPA-babel103bv0.4b) limited language packs, a 50-hour subset of Broadcast News (BN-50) (Kingsbury, 2009), and TIMIT (Garofolo et al., 1993).
Dataset Splits Yes Each data set is partitioned in four: a training set, a heldout set, a development set, and a test set. We use the heldout set to tune the hyperparameters of our training procedure (e.g., the learning rate). The Bengali and Cantonese Babel data sets both include training and development sets of approximately 20 hours, and an approximately 5 hour test set. We designate about 10% of the training data as a heldout set. ... For the Broadcast News data set, we use 45 hours of audio for training, and 5 hours as a heldout set. For the development set, we use the EARS Dev-04f data set (as described by Kingsbury, 2009), which consists of approximately three hours of broadcast news from various news shows. We use the DARPA EARS RT-03 English Broadcast News Evaluation Set (Fiscus et al., 2003) as our test set, consisting of 72 five minute conversations. ... The TIMIT data set contains recordings of 630 speakers, of various English dialects, each reciting ten sentences, for a total of 5.4 hours of speech. The training set (from which the heldout set is then taken) consists of data from 462 speakers each reciting 8 sentences (SI and SX sentences). The development set consists of speech from 50 speakers. For evaluation, we use the core test set, which consists of 192 utterances total from 24 speakers (SA sentences are excluded). Table 2: Data set details. We report the size of each data set partition in terms of the number of hours of speech, and in terms of the number of acoustic frames (in parentheses).
Hardware Specification No All our training code is written in MATLAB, leveraging its GPU features, and executed on Amazon EC2 machines. The paper does not specify the exact GPU model, CPU type, or particular EC2 instance type used.
Software Dependencies No All our training code is written in MATLAB. The paper does not provide specific version numbers for MATLAB or any other software libraries or dependencies used.
Experiment Setup Yes These kernel models typically have three hyperparameters: the kernel bandwidth (σ for the Gaussian kernels, λ for the Laplacian kernel; see Table 1), the number of random projections, and the initial learning rate. We try various numbers of random features, ranging from 5000 to 400,000. ... The sparse Gaussian kernel additionally has the hyperparameter k which specifies the sparsity of each random projection vector ωi. For all experiments, we use k = 5. For all DNNs, we tune hyperparameters related to both the architecture and the optimization. This includes the number of layers, the number of hidden units in each layer, and the learning rate. ... We find that four hidden layers is generally the best setting for our DNNs... Additionally, all our DNNs use the tanh activation function. We vary the number of hidden units per layer (1000, 2000, or 4000). ... We use minibatches of size of 250 or 256 samples during training... This method divides the learning rate in half at the end of an SGD epoch if the heldout cross-entropy doesn’t improve by at least 1%; additionally, if the heldout cross-entropy gets worse, it reverts the model back to its state at the beginning of the epoch. ... We terminate training once we have divided the learning rate 10 times. ... We use bottlenecks of size 1000, 250, 250, and 100 for BN-50, Bengali, Cantonese, and TIMIT, respectively. ... We initialize our DNN parameters uniformly at random within [sqrt(6/(din+dout)), sqrt(6/(din+dout))]. For our kernel models, we initialize the random projection matrix as discussed in Section 3, and we initialize the parameter matrix Θ as the zero matrix. ... For each iteration of random feature selection, we draw a random subsample of the training data of size R = 10^6 (except when D < 10^5, in which case we use R = 2 * 10^6), but ultimately we use all N training examples once the random features are selected. ... We use T = 50 iterations of feature selection, and in iteration t, we select st = t (D/T) = 0.02Dt random features.