Joint PLDA for Simultaneous Modeling of Two Factors
Authors: Luciana Ferrer, Mitchell McLaren
JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show results on a multilingual speaker-verification task, where the language spoken is considered a nuisance condition. The proposed joint PLDA approach leads to significant performance gains in this task for two different data sets, in particular when the training data contains mostly or only monolingual speakers. ... 5. Experimental Setup ... 6. Method Comparison |
| Researcher Affiliation | Collaboration | Luciana Ferrer EMAIL Instituto de Investigaci on en Ciencias de la Computaci on (ICC) CONICET-Universidad de Buenos Aires Pabell on I, Ciudad Universitaria, 1428, Ciudad Aut onoma de Buenos Aires, Argentina Mitchell Mc Laren EMAIL Speech Technology and Research Lab (Star Lab) SRI International 333 Ravenswood Ave, Menlo Park, 94025, United States |
| Pseudocode | Yes | Algorithm 1 Smart EM initialization approach for JPLDA. |
| Open Source Code | No | In our code we use the formulation derived by Cumani et al. (2014), Equation (34). Note, though, that the last term in that equation should not be there (this mistake was confirmed by one coauthor of the paper). |
| Open Datasets | Yes | We show results and detailed analysis on two multilingual speaker recognition data sets, one composed of Mixer data (Cieri et al., 2007) from the speaker recognition evaluations organized by NIST and another that uses LASRS data (Beck et al., 2004). We evaluate two training scenarios, one using all available training data from the PRISM data set (Ferrer et al., 2011), which contains a small percentage of speakers speaking two different languages, and one where we subset the training data to contain only one language per speaker. ... The FULL training set is composed of: Switchboard Cellular Part 1 (Graffet al., 2001) and Cellular Part 2 (Graffet al., 2004), consisting of English cellphone conversations Switchboard 2 Phase 2 (Graffet al., 1999) and Phase 3 (Graffet al., 2002) samples, consisting of English telephone conversations Mixer data (Cieri et al., 2007) from the 2004 to 2008 speaker recognition evaluations organized by the National Institute of Standards and Technology (NIST). |
| Dataset Splits | Yes | The final set of trials contains 11,522 target trials and 858,119 impostor trials. The LASRS trials for this work are created by enrolling with data from the first recorded session and testing on the second recorded session in each of the two spoken languages. We use only the conversational data from each session. This results in the same number of same-language and cross-language trials for a total of 848 target trials and 100336 impostor trials for each of seven different microphones: a camcorder microphone (Cm); a Desktop microphone (Dm); a studio microphone (Sm); an omnidirectional microphone (Om); a local telephone microphone (Tm); a remote telephone microphone (Tk); and a telephone earpiece (Ts). For this study, we only consider same-microphone trials for simplicity of analysis. ... The trials are created by selecting the same number of target and impostor same-language and cross-language trials such that the final set of trials is a balanced union of both types of trials. Further, the same-language trials are created as a balanced union of English versus non-English trials. |
| Hardware Specification | No | The paper mentions various recording devices (microphones like camcorder, desktop, studio, omnidirectional, telephone), but it does not specify the computational hardware (e.g., CPU, GPU models, memory) used for running the experiments or training the models. |
| Software Dependencies | No | The paper describes various algorithms and models (e.g., EM algorithm, GMM, DNN, PLDA), but it does not specify any particular software libraries, frameworks, or operating system versions used for their implementation or experiments. |
| Experiment Setup | Yes | For SPLDA, FPLDA and JPLDA, the LDA dimension is set to 400; no dimensionality reduction is done in these cases but the data is still transformed by the LDA matrix, centered and length normalized. For TPLDA, on the other hand, we use an LDA dimension of 200, because we found that this value gives significantly better performance than keeping the original dimension of 400. The speaker and language ranks for all experiments in this section are fixed to 200 and 16, respectively. ... Unless otherwise stated, all JPLDA results are obtained using P(HSC|HSS) = P(HSC|HDS) = 0.5. For TPLDA we use a diagonal matrix for the covariance of the noise model which proved to be slightly better than a full covariance. ... The first 20 mel-frequency cepstral coefficients (MFCCs) are extracted from the audio signal using a 25ms window every 10ms. ... This results in a feature vector of 60 dimensions ... The UBM is then estimated with an EM algorithm using the speech frames from a random subset of 10,000 samples ... The resulting 620-dimensional feature vector forms the input to a DNN that consists of two hidden layers of sizes 500 and 100. The output layer of the DNN consists of two nodes trained to predict the posteriors for the speech and non-speech classes. |