Open-Source Conversational AI with SpeechBrain 1.0

Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Pierre Champion, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Ha Nguyen, Xuechen Liu, Sangeet Sagar, Jarod Duret, Salima Mdhaffar, Gaëlle Laperrière, Mickael Rouvier, Renato De Mori, Yannick Estève

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents Speech Brain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. Speech Brain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.
Researcher Affiliation Collaboration 1Concordia University, 2Mila-Quebec AI Institute, 3Avignon University, 4Samsung AI Center Cambridge, 5Universit e de Montr eal, 6University of Cambridge, 7Laval University, 8Zaion, 9Fondazione Bruno Kessler, 10University of Bologna, 11Telecom Paris, 12University of Edinburgh, 13Inria, 14Aalto University, 15University of Trento, 16Saarland University, 17National Institute of Informatics Tokyo, 18Silo AI, 19KU Leuven, 20Idiap, 21EPFL, 22Mc Gill University
Pseudocode No The paper describes methods and procedures in narrative text, but does not include any clearly labeled pseudocode blocks or algorithms formatted as such.
Open Source Code Yes Speech Brain1 is an open-source Conversational AI toolkit based on Py Torch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete recipes of code and algorithms required for training them. (...) 1. https://speechbrain.github.io/
Open Datasets Yes Moreover, 95% of our recipes utilize freely available data and include comprehensive training logs, checkpoints, and other essential information. (...) For the EEG modality, we rely on two key dependencies: MOABB7(Aristimunha et al., 2024) and MNE8(Gramfort et al., 2014).
Dataset Splits No First, users need to specify the data for training, validation, and testing using CSV or JSON files. These formats are supported because they allow flexible and intuitive declaration of input files and annotations.
Hardware Specification Yes This enabled efficient training of a 1-billionparameter SSL model for French on 14,000 hours of speech using over 100 A100 GPUs, showcasing the scalability of Speech Brain (Parcollet et al., 2024).
Software Dependencies Yes Speech Brain1 is an open-source Conversational AI toolkit based on Py Torch (...) Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch.
Experiment Setup No Next, users must design a model and define its hyperparameters using a modified YAML format known as Hyper Py YAML. This format facilitates complex yet elegant parameter configurations, defining objects and their associated arguments. (...) The improvement primarily originated from a more robust data augmentation strategy and a more careful selection of the training hyperparameters.