reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Joint PLDA for Simultaneous Modeling of Two Factors

Authors: Luciana Ferrer, Mitchell McLaren

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show results on a multilingual speaker-veriﬁcation task, where the language spoken is considered a nuisance condition. The proposed joint PLDA approach leads to signiﬁcant performance gains in this task for two diﬀerent data sets, in particular when the training data contains mostly or only monolingual speakers. ... 5. Experimental Setup ... 6. Method Comparison
Researcher Affiliation	Collaboration	Luciana Ferrer EMAIL Instituto de Investigaci on en Ciencias de la Computaci on (ICC) CONICET-Universidad de Buenos Aires Pabell on I, Ciudad Universitaria, 1428, Ciudad Aut onoma de Buenos Aires, Argentina Mitchell Mc Laren EMAIL Speech Technology and Research Lab (Star Lab) SRI International 333 Ravenswood Ave, Menlo Park, 94025, United States
Pseudocode	Yes	Algorithm 1 Smart EM initialization approach for JPLDA.
Open Source Code	No	In our code we use the formulation derived by Cumani et al. (2014), Equation (34). Note, though, that the last term in that equation should not be there (this mistake was conﬁrmed by one coauthor of the paper).
Open Datasets	Yes	We show results and detailed analysis on two multilingual speaker recognition data sets, one composed of Mixer data (Cieri et al., 2007) from the speaker recognition evaluations organized by NIST and another that uses LASRS data (Beck et al., 2004). We evaluate two training scenarios, one using all available training data from the PRISM data set (Ferrer et al., 2011), which contains a small percentage of speakers speaking two diﬀerent languages, and one where we subset the training data to contain only one language per speaker. ... The FULL training set is composed of: Switchboard Cellular Part 1 (Graﬀet al., 2001) and Cellular Part 2 (Graﬀet al., 2004), consisting of English cellphone conversations Switchboard 2 Phase 2 (Graﬀet al., 1999) and Phase 3 (Graﬀet al., 2002) samples, consisting of English telephone conversations Mixer data (Cieri et al., 2007) from the 2004 to 2008 speaker recognition evaluations organized by the National Institute of Standards and Technology (NIST).
Dataset Splits	Yes	The final set of trials contains 11,522 target trials and 858,119 impostor trials. The LASRS trials for this work are created by enrolling with data from the first recorded session and testing on the second recorded session in each of the two spoken languages. We use only the conversational data from each session. This results in the same number of same-language and cross-language trials for a total of 848 target trials and 100336 impostor trials for each of seven different microphones: a camcorder microphone (Cm); a Desktop microphone (Dm); a studio microphone (Sm); an omnidirectional microphone (Om); a local telephone microphone (Tm); a remote telephone microphone (Tk); and a telephone earpiece (Ts). For this study, we only consider same-microphone trials for simplicity of analysis. ... The trials are created by selecting the same number of target and impostor same-language and cross-language trials such that the final set of trials is a balanced union of both types of trials. Further, the same-language trials are created as a balanced union of English versus non-English trials.
Hardware Specification	No	The paper mentions various recording devices (microphones like camcorder, desktop, studio, omnidirectional, telephone), but it does not specify the computational hardware (e.g., CPU, GPU models, memory) used for running the experiments or training the models.
Software Dependencies	No	The paper describes various algorithms and models (e.g., EM algorithm, GMM, DNN, PLDA), but it does not specify any particular software libraries, frameworks, or operating system versions used for their implementation or experiments.
Experiment Setup	Yes	For SPLDA, FPLDA and JPLDA, the LDA dimension is set to 400; no dimensionality reduction is done in these cases but the data is still transformed by the LDA matrix, centered and length normalized. For TPLDA, on the other hand, we use an LDA dimension of 200, because we found that this value gives signiﬁcantly better performance than keeping the original dimension of 400. The speaker and language ranks for all experiments in this section are ﬁxed to 200 and 16, respectively. ... Unless otherwise stated, all JPLDA results are obtained using P(HSC\|HSS) = P(HSC\|HDS) = 0.5. For TPLDA we use a diagonal matrix for the covariance of the noise model which proved to be slightly better than a full covariance. ... The ﬁrst 20 mel-frequency cepstral coeﬃcients (MFCCs) are extracted from the audio signal using a 25ms window every 10ms. ... This results in a feature vector of 60 dimensions ... The UBM is then estimated with an EM algorithm using the speech frames from a random subset of 10,000 samples ... The resulting 620-dimensional feature vector forms the input to a DNN that consists of two hidden layers of sizes 500 and 100. The output layer of the DNN consists of two nodes trained to predict the posteriors for the speech and non-speech classes.