LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Authors: Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 4, titled "EXPERIMENTS", details the experimental setups, evaluation metrics, baseline systems, and results in both offline and streaming scenarios, including a human evaluation. This clearly indicates empirical studies with data analysis. |
| Researcher Affiliation | Academia | All authors are affiliated with "Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)" and/or "University of Chinese Academy of Sciences", both of which are academic institutions. The email domains are "@ict.ac.cn" which further confirms academic affiliation. |
| Pseudocode | Yes | Algorithm 1: Inference Process is clearly presented on page 4, detailing the inference steps in a structured algorithm block. |
| Open Source Code | Yes | Footnote 1 on page 1 states: "Code: https://github.com/ictnlp/LLa MA-Omni Model: https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Audio Samples: https://ictnlp.github.io/llama-omni-demo/" |
| Open Datasets | Yes | For the training data, we collect around 50K instructions from the Alpaca dataset7 (Taori et al., 2023)... Additionally, we gather around 150K instructions from the Ultra Chat dataset8 (Ding et al., 2023)... |
| Dataset Splits | Yes | For the training data, we use the Instruct S2S-200K dataset mentioned in Section 3, which includes 200K speech instruction data. For the evaluation data, we select two subsets from Alpaca Eval11 (Li et al., 2023): helpful base and vicuna, as their questions are more suitable for speech interaction scenarios. We remove questions related to math and code, resulting in a total of 199 instructions. To obtain the speech version, we use the Cosy Voice-300M-SFT model to synthesize the instructions into speech. We refer to this test set as Instruct S2S-Eval in the following sections. |
| Hardware Specification | Yes | The entire training process takes approximately 65 hours on 4 NVIDIA L40 GPUs. We measure the latency on 1 NVIDIA L40 GPU. |
| Software Dependencies | No | The paper mentions several models and tools like "Whisper-large-v3", "Llama-3.1-8B-Instruct", "Hu BERT", "Hi Fi-GAN vocoder", "GPT-4o", "Cosy Voice-300M-SFT", and "VITS", often with citations to the original papers. However, it does not provide specific software version numbers for ancillary software components (e.g., Python, PyTorch, CUDA versions) needed for replication. |
| Experiment Setup | Yes | In the first stage, we train the speech adapter and the LLM with a batch size of 32 for 3 epochs. We use a cosine learning rate scheduler with the first 3% of steps for warmup, and the peak learning rate is set to 2e-5. In the second stage, we train the speech decoder, using the same batch size, number of steps, and learning rate scheduler as the first stage, but with the peak learning rate set to 2e-4. The speech adapter performs a 5 downsampling on the speech representations. The speech decoder consists of 2 Transformer layers with a hidden dimension of 4096, 32 attention heads, and a feed-forward network dimension of 11008... The upsample factor λ is set to 25. For the minimum unit chunk size Ωinput to the vocoder, we set Ω= + in the offline scenario, meaning we wait for the entire unit sequence to be generated before inputting it to the vocoder for speech synthesis. In the streaming scenario, we adjust the value of Ωwithin the range of [10, 20, 40, 60, 80, 100] to control the response latency of the model. |