LAMA-UT: Language Agnostic Multilingual ASR Through Orthography Unification and Language-Specific Transliteration
Authors: Sangmin Lee, Woojin Chung, Hong-Goo Kang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper s training data. |
| Researcher Affiliation | Academia | Sangmin Lee1, Woojin Chung1, Hong-Goo Kang1* 1Dept. of Electrical & Electronic Engineering, Yonsei University, South Korea EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed method in prose and through diagrams (Figure 1), but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'https://github.com/sanghyang00/LAMA-UT-Appendices' in a footnote, stating 'the end-to-end performance comparison between IPA-based and Romanization-based LAMA-UT, and the results are shown in the appendices1'. This link is specifically for appendices and results, not explicitly for the release of source code for the methodology described in the paper. |
| Open Datasets | Yes | FLEURS (Conneau et al. 2022) is a multilingual speech corpus encompassing 102 languages. It provides a relatively small amount of data per language (approximately 12 hours) while ensuring an unbiased distribution of data across the languages. Given our focus on demonstrating effective multilingual ASR with minimal data, we utilize the FLEURS and its official splits for experiments. Common Voice. Common Voice (Ardila et al. 2020) is a multilingual speech dataset crowdsourced from speakers of various languages. For unseen languages, we leverage the official test split of 25 languages from Common Voice 17.0, which offers sufficient samples for evaluation. |
| Dataset Splits | Yes | Given our focus on demonstrating effective multilingual ASR with minimal data, we utilize the FLEURS and its official splits for experiments. Common Voice. Common Voice (Ardila et al. 2020) is a multilingual speech dataset crowdsourced from speakers of various languages. For unseen languages, we leverage the official test split of 25 languages from Common Voice 17.0, which offers sufficient samples for evaluation. |
| Hardware Specification | Yes | Finally, the entire training pipeline was conducted on two RTX-3090 GPUs with 24GB of VRAM each, and we leveraged gradient accumulation techniques to address memory issues. |
| Software Dependencies | No | We utilized the Python library Uroman (Hermjakob, May, and Knight 2018) to obtain Romanized transcription and Phonemizer (Bernard and Titeux 2021) for IPA transcription. For Japanese, we employed Pykakasi (TAKAHASHI 1992) due to the limitation of Uroman, which treats Japanese kanji as Chinese characters. ... We selected wav2vec2.0-XLSR (Babu et al. 2021) with 1 billion parameters... We utilized LLa MA3-8B, 4-bit quantized LLa MA3-70B (Touvron et al. 2023), and GPT-4o-mini (Open AI 2024) as the universal converter. We leveraged a beam search decoder from flashlight (Kahn et al. 2022)... The paper lists specific libraries and models, but does not provide version numbers for most software dependencies like Uroman, Phonemizer, or Pykakasi. While LLaMA3-8B, LLaMA3-70B, and GPT-4o-mini are specific model versions, they are not general software dependencies with precise version numbers for reproducibility. |
| Experiment Setup | Yes | We performed fine-tuning on all layers except the feature extractor for 3,000 steps with a CTC loss and a batch size of 128. We bypassed the two-stage fine-tuning pipeline from prior studies (Xu, Baevski, and Auli 2021; Pratap et al. 2024) because our distinct methodology, which used a smaller dataset, caused the divided fine-tuning approach to result in premature convergence and instability. For hyperparameters, we employed the default Adam W optimizer (Kingma and Ba 2017; Loshchilov and Hutter 2019) with a tri-stage learning rate scheduler. The warm-up, hold, and decay phases were configured to 10%, 60%, and 30% of the total training steps, respectively. We then performed a series of experiments to determine the optimal learning rate schedule within the range of 5e-6 to 5e-4. ... We leveraged a beam search decoder from flashlight (Kahn et al. 2022) with a beam size of 100. ... We set the temperature value to 0.0 for all LLMs to obtain deterministic results. |