reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Speech Technology to 1,000+ Languages

Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identiﬁcation model for 4,017 languages.
Researcher Affiliation	Industry	. Core Team. Corresponding Authors: EMAIL. . JPMorgan Chase. Work done while at Meta AI. . Apple. Work done while at Meta AI . Open AI. Work done while at Meta AI
Pseudocode	Yes	Appendix A illustrates the algorithm and an implementation is available as part of Torch Audio (Yang et al., 2021).6 Algorithm 1 Pseudo code of our CTC Forced Alignment algorithm on GPU
Open Source Code	Yes	The MMS models and tooling for data pre-processing are available at https://github.com/pytorch/fairseq/tree/master/examples/mms. A full list of the languages supported is available at https://github.com/ facebookresearch/fairseq/tree/main/examples/mms. This model is available at https://github.com/facebookresearch/fairseq/tree/main/examples/mms. an implementation is available as part of Torch Audio (Yang et al., 2021).6 https://github.com/pytorch/audio
Open Datasets	Yes	Our work leverages two new datasets to expand the language coverage of speech technology. In this section, we ﬁrst detail how we create a labeled dataset which includes speech audio paired with corresponding text in 1,107 languages (MMS-lab; 44.7K hours; 3.1). Second, we discuss the creation of an unlabeled dataset for which we only have audio recordings and no corresponding text. This dataset spans 3,809 languages (MMS-unlab; 7.7K total hours; 3.2). The MMS-lab dataset is based on recordings of people reading the New Testament in diﬀerent languages. ... Speciﬁcally, we obtain data from Faith Comes By Hearing1, goto.bible and bible.com. The data source for this dataset is Global Recordings Network which provides recordings of Bible stories, evangelistic messages, scripture readings, and songs in more than 6,255 languages and dialects.11 https://globalrecordings.net/ . MMS-lab-U: 1,362 languages comprising 55K hours ( 3.1). Multilingual Librispech (MLS): 8 European languages of read books totaling 50K hours (Pratap et al., 2020c) Common Voice (CV): 89 languages totaling 8.8 hours of read text; we use v9.0 of the corpus (Ardila et al., 2020)) Vox Lingua-107 (VL): 107 languages totaling 5.3K hours of You Tube content (Valk and Alumäe, 2020) BABEL (BBL): 17 African and Asian languages totaling about 1K hours of conversational telephone data (Gales et al., 2014) Vox Populi (VP): 371K hours of unlabeled speech data in 23 languages derived from European Parliament event recordings (Wang et al., 2021)
Dataset Splits	Yes	Concretely, we use the book Mark (MRK) as development set, the book John (JHN) as test set and the remaining books for training. For the 147 recordings where not all 260 chapters are available, we deviate from this and make a best eﬀort split by books depending on which books are available in the respective recording. In this case, we aim to have at least 10% of all available data in the development set and the test set each, or at most two hours of data in each set, whichever is less. The ﬁnal dataset contains 44.7K hours of paired speech data where we use 36.8K hours for training (82.3%), 3.5K hours for development (7.8%) and 4.4K hours for testing (9.9%). Finally, we split the samples of each language randomly into 80% training data, 10% development data, and 10% test data.
Hardware Specification	Yes	all models were pre-trained for a total of one million updates on A100 GPUs with 80GB of memory. The MMS (0.3B) model was trained with an eﬀective batch size of 2.3 hours of data across 48 GPUs and the MMS (1B) model was trained with an eﬀective batch size of 3.5 hours on 64 GPUs. We train models for a total of 50K updates with a batch size of 0.8 hours of data using 16 A100 GPUs with 80GB of memory. we train each model for 100K steps using eight V100-GPUs with a batch size of 64 per GPU.
Software Dependencies	No	We largely follow prior work in training cross-lingual wav2vec 2.0 models (Conneau et al., 2020a; Babu et al., 2022) and use the wav2vec 2.0 implementation available in fairseq (Ott et al., 2019) to train models with roughly 300M and 1B parameters (Table 2). To make eﬃcient use of GPU memory, we use a fully sharded backend (Rajbhandari et al., 2021) as well as activation checkpointing (Chen et al., 2016) implemented in Fair Scale (Baines et al., 2021). Our models are optimized with Adam (Kingma and Ba, 2015). We train 5-gram language models on Common Crawl data using Ken LM (Heaﬁeld, 2011) for each language in FLEURS. We use the CTC beam-search decoder from the Flashlight (Kahn et al., 2022) library for decoding our models.
Experiment Setup	Yes	Our models are optimized with Adam (Kingma and Ba, 2015) and the learning rate is warmed up for the ﬁrst 32K steps followed by polynomial decay to zero for the remainder of training. Training audio sequences are cropped to a maximum of 320K samples, or 20 seconds, and all models were pre-trained for a total of one million updates on A100 GPUs with 80GB of memory. The MMS (0.3B) model was trained with an eﬀective batch size of 2.3 hours of data across 48 GPUs and the MMS (1B) model was trained with an eﬀective batch size of 3.5 hours on 64 GPUs. We use Adam (Kingma and Ba, 2015) with exponential decay rates β1 = 0.9, β2 = 0.98 to train model weights using a tri-stage schedule where the learning rate is warmed up for the ﬁrst 10% of updates, held constant for the next 40% updates, and then decayed in the ﬁnal 50% updates. We experimented with diﬀerent learning rates (1 10 4, 7 10 4, 3 10 4 1 10 5, 7 10 6, 3 10 6, 1 10 6) and number of updates (50K, 100K, 200K, 300K). Unless otherwise mentioned, we ﬁne-tune models for a total of 50K updates with a batch size of 0.8 hours of data using 16 A100 GPUs with 80GB of memory. We train models with Adam with exponential decay rates β1 = 0.9 and β2 = 0.98 and a tri-state learning rate schedule where the learning rate is warmed up for the ﬁrst 10% of updates, held constant for the next 40% and then linearly decayed for the remainder of training (Kingma and Ba, 2015). During development, we experiment with diﬀerent hyper-parameters and perform ﬁnal model selection based on development set accuracy. We experiment with diﬀerent learning rates (1 10 5, 3 10 5, 3 10 6, 5 10 6, 7 10 6), training updates (10K, 20K, 30K, 40K, 50K) and batch sizes (1.5min, 3min, 6min). We train models on 16 GPUs. We train each model for 100K steps using eight V100-GPUs with a batch size of 64 per GPU.