Scaling Speech Technology to 1,000+ Languages
Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. |
| Researcher Affiliation | Industry | . Core Team. Corresponding Authors: EMAIL. . JPMorgan Chase. Work done while at Meta AI. . Apple. Work done while at Meta AI . Open AI. Work done while at Meta AI |
| Pseudocode | Yes | Appendix A illustrates the algorithm and an implementation is available as part of Torch Audio (Yang et al., 2021).6 Algorithm 1 Pseudo code of our CTC Forced Alignment algorithm on GPU |
| Open Source Code | Yes | The MMS models and tooling for data pre-processing are available at https://github.com/pytorch/fairseq/tree/master/examples/mms. A full list of the languages supported is available at https://github.com/ facebookresearch/fairseq/tree/main/examples/mms. This model is available at https://github.com/facebookresearch/fairseq/tree/main/examples/mms. an implementation is available as part of Torch Audio (Yang et al., 2021).6 https://github.com/pytorch/audio |
| Open Datasets | Yes | Our work leverages two new datasets to expand the language coverage of speech technology. In this section, we first detail how we create a labeled dataset which includes speech audio paired with corresponding text in 1,107 languages (MMS-lab; 44.7K hours; 3.1). Second, we discuss the creation of an unlabeled dataset for which we only have audio recordings and no corresponding text. This dataset spans 3,809 languages (MMS-unlab; 7.7K total hours; 3.2). The MMS-lab dataset is based on recordings of people reading the New Testament in different languages. ... Specifically, we obtain data from Faith Comes By Hearing1, goto.bible and bible.com. The data source for this dataset is Global Recordings Network which provides recordings of Bible stories, evangelistic messages, scripture readings, and songs in more than 6,255 languages and dialects.11 https://globalrecordings.net/ . MMS-lab-U: 1,362 languages comprising 55K hours ( 3.1). Multilingual Librispech (MLS): 8 European languages of read books totaling 50K hours (Pratap et al., 2020c) Common Voice (CV): 89 languages totaling 8.8 hours of read text; we use v9.0 of the corpus (Ardila et al., 2020)) Vox Lingua-107 (VL): 107 languages totaling 5.3K hours of You Tube content (Valk and Alumäe, 2020) BABEL (BBL): 17 African and Asian languages totaling about 1K hours of conversational telephone data (Gales et al., 2014) Vox Populi (VP): 371K hours of unlabeled speech data in 23 languages derived from European Parliament event recordings (Wang et al., 2021) |
| Dataset Splits | Yes | Concretely, we use the book Mark (MRK) as development set, the book John (JHN) as test set and the remaining books for training. For the 147 recordings where not all 260 chapters are available, we deviate from this and make a best effort split by books depending on which books are available in the respective recording. In this case, we aim to have at least 10% of all available data in the development set and the test set each, or at most two hours of data in each set, whichever is less. The final dataset contains 44.7K hours of paired speech data where we use 36.8K hours for training (82.3%), 3.5K hours for development (7.8%) and 4.4K hours for testing (9.9%). Finally, we split the samples of each language randomly into 80% training data, 10% development data, and 10% test data. |
| Hardware Specification | Yes | all models were pre-trained for a total of one million updates on A100 GPUs with 80GB of memory. The MMS (0.3B) model was trained with an effective batch size of 2.3 hours of data across 48 GPUs and the MMS (1B) model was trained with an effective batch size of 3.5 hours on 64 GPUs. We train models for a total of 50K updates with a batch size of 0.8 hours of data using 16 A100 GPUs with 80GB of memory. we train each model for 100K steps using eight V100-GPUs with a batch size of 64 per GPU. |
| Software Dependencies | No | We largely follow prior work in training cross-lingual wav2vec 2.0 models (Conneau et al., 2020a; Babu et al., 2022) and use the wav2vec 2.0 implementation available in fairseq (Ott et al., 2019) to train models with roughly 300M and 1B parameters (Table 2). To make efficient use of GPU memory, we use a fully sharded backend (Rajbhandari et al., 2021) as well as activation checkpointing (Chen et al., 2016) implemented in Fair Scale (Baines et al., 2021). Our models are optimized with Adam (Kingma and Ba, 2015). We train 5-gram language models on Common Crawl data using Ken LM (Heafield, 2011) for each language in FLEURS. We use the CTC beam-search decoder from the Flashlight (Kahn et al., 2022) library for decoding our models. |
| Experiment Setup | Yes | Our models are optimized with Adam (Kingma and Ba, 2015) and the learning rate is warmed up for the first 32K steps followed by polynomial decay to zero for the remainder of training. Training audio sequences are cropped to a maximum of 320K samples, or 20 seconds, and all models were pre-trained for a total of one million updates on A100 GPUs with 80GB of memory. The MMS (0.3B) model was trained with an effective batch size of 2.3 hours of data across 48 GPUs and the MMS (1B) model was trained with an effective batch size of 3.5 hours on 64 GPUs. We use Adam (Kingma and Ba, 2015) with exponential decay rates β1 = 0.9, β2 = 0.98 to train model weights using a tri-stage schedule where the learning rate is warmed up for the first 10% of updates, held constant for the next 40% updates, and then decayed in the final 50% updates. We experimented with different learning rates (1 10 4, 7 10 4, 3 10 4 1 10 5, 7 10 6, 3 10 6, 1 10 6) and number of updates (50K, 100K, 200K, 300K). Unless otherwise mentioned, we fine-tune models for a total of 50K updates with a batch size of 0.8 hours of data using 16 A100 GPUs with 80GB of memory. We train models with Adam with exponential decay rates β1 = 0.9 and β2 = 0.98 and a tri-state learning rate schedule where the learning rate is warmed up for the first 10% of updates, held constant for the next 40% and then linearly decayed for the remainder of training (Kingma and Ba, 2015). During development, we experiment with different hyper-parameters and perform final model selection based on development set accuracy. We experiment with different learning rates (1 10 5, 3 10 5, 3 10 6, 5 10 6, 7 10 6), training updates (10K, 20K, 30K, 40K, 50K) and batch sizes (1.5min, 3min, 6min). We train models on 16 GPUs. We train each model for 100K steps using eight V100-GPUs with a batch size of 64 per GPU. |