OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Authors: William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with scaling in terms of both model and data size, and analyze the change in downstream ASR/ST performance. Through these investigations, we derive a neural scaling law to predict the change in model performance for each task and language. We also evaluate test-time capabilities of large-scale ASR/ST models, studying how new abilities emerge at scale and showing how speech model scaling can be benefits to new languages with in-context learning. Our contributions are summarized as follows: We open-source OWLS, a collection of 13 Whisper-style ASR/ST models trained on up to 360K hours of publicly available data and 150 languages. We will also release all model training code, training logs, and intermediate checkpoints. We train and release an OWLS model with 18B total parameters, which makes it the largest of all publicly known ASR/ST models and nearly double that of prior work (Zheng et al., 2022). We systemically evaluate the effects of model and data scaling on ASR and ST, developing the first set of neural scaling laws for these tasks. We not only measure the usefulness of model scaling, but also identify failure cases that it is not able to overcome. We evaluate the test-time capabilities of frozen large-scale speech foundation models via in-context learning, and discover several new emergent abilities present in large models that are absent in smaller ones.
Researcher Affiliation Collaboration William Chen 1 Jinchuan Tian 1 Yifan Peng 1 Brian Yan 1 Chao-Han Huck Yang 2 Shinji Watanabe 1 1Carnegie Mellon University 2NVIDIA. Correspondence to: William Chen <EMAIL>, Chao-Han Huck Yang <EMAIL>, Shinji Watanabe <EMAIL>.
Pseudocode No The paper describes the methodology and model architecture in detail, including mathematical equations for scaling laws and training details, but does not include any distinct pseudocode or algorithm blocks.
Open Source Code Yes We open-source OWLS, a collection of 13 Whisper-style ASR/ST models trained on up to 360K hours of publicly available data and 150 languages. We will also release all model training code, training logs, and intermediate checkpoints.
Open Datasets Yes We largely rely on the OWSM v3.2 (Tian et al., 2024) dataset for our experiments. It consists of 180K hours of ASR/ST data gathered across 25 public corpora, covering 150 unique languages. For our experiments on scaling up the training data size beyond 180K hours, we also include an additional 180K hours from a cleaned subset of YODAS (Li et al., 2023) from Peng et al. (2025), for a total of 360K hours. Note that this YODAS data is only used to train two models (OWLS 1B 360K and OWLS 18B v2). More details about the dataset can be found in Section A in the Appendix.
Dataset Splits Yes Multilingual ASR: To evaluate the multilingual performance of the OWLS models, we use the 102-language FLEURS test set (Conneau et al., 2022). H.1. Quechua Evaluation Quechua is a low-resource language indigenous to Peru and does not appear in any of the training data that we use. To perform the Quechua ICL evaluation, we use the IWSLT 2024 (Ahmad et al., 2024) version of the Siminchik corpus (Cardenas et al., 2018). We filter out all utterances longer than 7 seconds and split the corpus such that a speaker can only appear in the training or test set. We then further subsample the training set to 150 utterances to reduce compute costs.
Hardware Specification Yes We use a mix of A100, H100, and GH200 GPUs for supervised training (Table 7). Table 7. Training details for each model size. Params. Data Hrs. GPU Type GPUs per Node Nodes Days Training GPU Hours 0.25B 180K H100 8 2 3 1,164 0.50B 180K H100 8 2 4 1,512 1B 11K H100 8 2 6 2,232 1B 22K H100 8 3 5 2,790 1B 45K H100 8 3 5 2,790 1B 90K H100 8 3 5 2,790 1B 180K H100 8 3 5 2,790 1B 360K H200 4 8 7 5,120 2B 180K H100 8 2 7 2,520 4B 180K H100 8 3 9 5,148 9B 180K H100 8 3 15 8,472 18B 180K H100 8 6 17 19,440
Software Dependencies Yes We use the Adam optimizer (Kingma & Ba, 2015) with a piecewise scheduler (Peng et al., 2024) that linearly warms up the learning rate from 0 to 5.0e-5 in the first 30K steps, 5.0e-5 to 2.0e-4 in the next 30K steps, and finally exponentially decays for the remaining training steps. For the hybrid CTC/attention (Watanabe et al., 2017b) training, we use a CTC weight of 0.3. We use bfloat 16, Flash Attention 2 (Dao, 2024), and Deep Speed Zero Stage-2 (Rasley et al., 2020; Rajbhandari et al., 2020) to improve training efficiency.
Experiment Setup Yes All models use a total effective batch size of 256 utterances and are trained for 675K steps. We use the Adam optimizer (Kingma & Ba, 2015) with a piecewise scheduler (Peng et al., 2024) that linearly warms up the learning rate from 0 to 5.0e-5 in the first 30K steps, 5.0e-5 to 2.0e-4 in the next 30K steps, and finally exponentially decays for the remaining training steps. For the hybrid CTC/attention (Watanabe et al., 2017b) training, we use a CTC weight of 0.3. We use bfloat 16, Flash Attention 2 (Dao, 2024), and Deep Speed Zero Stage-2 (Rasley et al., 2020; Rajbhandari et al., 2020) to improve training efficiency. Table 8. Architecture hyper-parameter details for each model size. Params. Enc./Dec. Layers Hidden Size FFN Size Attn. Heads 0.25B 8 768 3072 16 0.50B 16 1024 4096 16 1B 32 1024 4096 16 2B 16 2048 8192 64 4B 36 2048 8192 64 9B 39 2816 11264 64 18B 64 3072 12288 64