Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

Authors: Muhammad Shah, David Solans Noguero, Mikko Heikkilä, Bhiksha Raj, Nicolas Kourtellis

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as the use of discrete representations, or self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females. Our results revealed noticeable disparities in the model s robustness across subgroups.
Researcher Affiliation Collaboration 1 Carnegie Mellon University, 2 Telefonica Research, 3 University of Helsinki
Pseudocode Yes Algorithm 1 Utterance Agnostic Attack Algorithm
Open Source Code Yes To facilitate out-of-the-box robustness evaluations for the community, we have publicly released a large dataset 1 containing perturbed versions of Libri Speech (Panayotov et al., 2015) test-clean, Spanish, and French and German test sets of Multilingual Libri Speech (Pratap et al., 2020), as well as accented speech from common voice, and segemented near- and far-field audios from CHi ME-6 (Reddy et al., 2020) and AMI(Kraaij et al., 2005). We release our code with clear documentation to enable reproducibility and extensibility. 2 2code:https://github.com/ahmedshah1494/speech_robust_bench
Open Datasets Yes To facilitate out-of-the-box robustness evaluations for the community, we have publicly released a large dataset 1 containing perturbed versions of Libri Speech (Panayotov et al., 2015) test-clean, Spanish, and French and German test sets of Multilingual Libri Speech (Pratap et al., 2020), as well as accented speech from common voice, and segemented near- and far-field audios from CHi ME-6 (Reddy et al., 2020) and AMI(Kraaij et al., 2005). 1data: https://huggingface.co/datasets/mshah1/speech_robust_bench_public
Dataset Splits Yes For Libri Speech, we use 500 utterances from test-dev as X dev and test-clean as X dev. For TEDLIUM, we use the full dev and test sets as X dev and X test. For Multi-Lingual Libri Speech, we use 500 utterances from the dev set in the relevant language as X dev and the full test set of the same language as X test.
Hardware Specification Yes The experiments were performed on the Bridges-2 cluster at the Pittsburgh Supercomputing Center. This cluster contains 200 32G and 16G Nvidia V-100, which were used for these experiments.
Software Dependencies No The paper mentions software tools like 'torchaudio', 'SoX', and 'robust_speech package (Olivier & Raj, 2022)', but does not provide specific version numbers for these components to ensure reproducibility of the ancillary software environment.
Experiment Setup Yes Concretely, we add real environmental noise from ESC-50 (Piczak, 2015), MS-SNSD (Reddy et al., 2019), MUSAN (Snyder et al., 2015) and WHAM! (Wichern et al., 2019) at Signal-to-Noise Ratios (SNR) of 10, 20, 30 and 40 d B. To simulate spatial acoustics, we add echo via So X4 and simulate Room Impulse Response (RIR) via convolution with real and simulated RIRs from Ko et al. (2017). ... To find δ, we follow Madry et al. (2018) and use projected gradient descent to solve maxδ:SNR(δ,x) ϵ L(M(x), y ), where L is a differentiable loss function, like CTC-Loss, between the model s output M(x) and the true transcript y , with ϵ [10, 40]. ... Table 2: The parameters defining the various severity levels of the perturbations used in the proposed benchmark.