reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Authors: William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with scaling in terms of both model and data size, and analyze the change in downstream ASR/ST performance. Through these investigations, we derive a neural scaling law to predict the change in model performance for each task and language. We also evaluate test-time capabilities of large-scale ASR/ST models, studying how new abilities emerge at scale and showing how speech model scaling can be benefits to new languages with in-context learning. Our contributions are summarized as follows: We open-source OWLS, a collection of 13 Whisper-style ASR/ST models trained on up to 360K hours of publicly available data and 150 languages. We will also release all model training code, training logs, and intermediate checkpoints. We train and release an OWLS model with 18B total parameters, which makes it the largest of all publicly known ASR/ST models and nearly double that of prior work (Zheng et al., 2022). We systemically evaluate the effects of model and data scaling on ASR and ST, developing the first set of neural scaling laws for these tasks. We not only measure the usefulness of model scaling, but also identify failure cases that it is not able to overcome. We evaluate the test-time capabilities of frozen large-scale speech foundation models via in-context learning, and discover several new emergent abilities present in large models that are absent in smaller ones.
Researcher Affiliation	Collaboration	William Chen 1 Jinchuan Tian 1 Yifan Peng 1 Brian Yan 1 Chao-Han Huck Yang 2 Shinji Watanabe 1 1Carnegie Mellon University 2NVIDIA. Correspondence to: William Chen <EMAIL>, Chao-Han Huck Yang <EMAIL>, Shinji Watanabe <EMAIL>.
Pseudocode	No	The paper describes the methodology and model architecture in detail, including mathematical equations for scaling laws and training details, but does not include any distinct pseudocode or algorithm blocks.
Open Source Code	Yes	We open-source OWLS, a collection of 13 Whisper-style ASR/ST models trained on up to 360K hours of publicly available data and 150 languages. We will also release all model training code, training logs, and intermediate checkpoints.
Open Datasets	Yes	We largely rely on the OWSM v3.2 (Tian et al., 2024) dataset for our experiments. It consists of 180K hours of ASR/ST data gathered across 25 public corpora, covering 150 unique languages. For our experiments on scaling up the training data size beyond 180K hours, we also include an additional 180K hours from a cleaned subset of YODAS (Li et al., 2023) from Peng et al. (2025), for a total of 360K hours. Note that this YODAS data is only used to train two models (OWLS 1B 360K and OWLS 18B v2). More details about the dataset can be found in Section A in the Appendix.
Dataset Splits	Yes	Multilingual ASR: To evaluate the multilingual performance of the OWLS models, we use the 102-language FLEURS test set (Conneau et al., 2022). H.1. Quechua Evaluation Quechua is a low-resource language indigenous to Peru and does not appear in any of the training data that we use. To perform the Quechua ICL evaluation, we use the IWSLT 2024 (Ahmad et al., 2024) version of the Siminchik corpus (Cardenas et al., 2018). We filter out all utterances longer than 7 seconds and split the corpus such that a speaker can only appear in the training or test set. We then further subsample the training set to 150 utterances to reduce compute costs.
Hardware Specification	Yes	We use a mix of A100, H100, and GH200 GPUs for supervised training (Table 7). Table 7. Training details for each model size. Params. Data Hrs. GPU Type GPUs per Node Nodes Days Training GPU Hours 0.25B 180K H100 8 2 3 1,164 0.50B 180K H100 8 2 4 1,512 1B 11K H100 8 2 6 2,232 1B 22K H100 8 3 5 2,790 1B 45K H100 8 3 5 2,790 1B 90K H100 8 3 5 2,790 1B 180K H100 8 3 5 2,790 1B 360K H200 4 8 7 5,120 2B 180K H100 8 2 7 2,520 4B 180K H100 8 3 9 5,148 9B 180K H100 8 3 15 8,472 18B 180K H100 8 6 17 19,440
Software Dependencies	Yes	We use the Adam optimizer (Kingma & Ba, 2015) with a piecewise scheduler (Peng et al., 2024) that linearly warms up the learning rate from 0 to 5.0e-5 in the first 30K steps, 5.0e-5 to 2.0e-4 in the next 30K steps, and finally exponentially decays for the remaining training steps. For the hybrid CTC/attention (Watanabe et al., 2017b) training, we use a CTC weight of 0.3. We use bfloat 16, Flash Attention 2 (Dao, 2024), and Deep Speed Zero Stage-2 (Rasley et al., 2020; Rajbhandari et al., 2020) to improve training efficiency.
Experiment Setup	Yes	All models use a total effective batch size of 256 utterances and are trained for 675K steps. We use the Adam optimizer (Kingma & Ba, 2015) with a piecewise scheduler (Peng et al., 2024) that linearly warms up the learning rate from 0 to 5.0e-5 in the first 30K steps, 5.0e-5 to 2.0e-4 in the next 30K steps, and finally exponentially decays for the remaining training steps. For the hybrid CTC/attention (Watanabe et al., 2017b) training, we use a CTC weight of 0.3. We use bfloat 16, Flash Attention 2 (Dao, 2024), and Deep Speed Zero Stage-2 (Rasley et al., 2020; Rajbhandari et al., 2020) to improve training efficiency. Table 8. Architecture hyper-parameter details for each model size. Params. Enc./Dec. Layers Hidden Size FFN Size Attn. Heads 0.25B 8 768 3072 16 0.50B 16 1024 4096 16 1B 32 1024 4096 16 2B 16 2048 8192 64 4B 36 2048 8192 64 9B 39 2816 11264 64 18B 64 3072 12288 64