reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Authors: Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Section 4, titled "EXPERIMENTS", details the experimental setups, evaluation metrics, baseline systems, and results in both offline and streaming scenarios, including a human evaluation. This clearly indicates empirical studies with data analysis.
Researcher Affiliation	Academia	All authors are affiliated with "Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)" and/or "University of Chinese Academy of Sciences", both of which are academic institutions. The email domains are "@ict.ac.cn" which further confirms academic affiliation.
Pseudocode	Yes	Algorithm 1: Inference Process is clearly presented on page 4, detailing the inference steps in a structured algorithm block.
Open Source Code	Yes	Footnote 1 on page 1 states: "Code: https://github.com/ictnlp/LLa MA-Omni Model: https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Audio Samples: https://ictnlp.github.io/llama-omni-demo/"
Open Datasets	Yes	For the training data, we collect around 50K instructions from the Alpaca dataset7 (Taori et al., 2023)... Additionally, we gather around 150K instructions from the Ultra Chat dataset8 (Ding et al., 2023)...
Dataset Splits	Yes	For the training data, we use the Instruct S2S-200K dataset mentioned in Section 3, which includes 200K speech instruction data. For the evaluation data, we select two subsets from Alpaca Eval11 (Li et al., 2023): helpful base and vicuna, as their questions are more suitable for speech interaction scenarios. We remove questions related to math and code, resulting in a total of 199 instructions. To obtain the speech version, we use the Cosy Voice-300M-SFT model to synthesize the instructions into speech. We refer to this test set as Instruct S2S-Eval in the following sections.
Hardware Specification	Yes	The entire training process takes approximately 65 hours on 4 NVIDIA L40 GPUs. We measure the latency on 1 NVIDIA L40 GPU.
Software Dependencies	No	The paper mentions several models and tools like "Whisper-large-v3", "Llama-3.1-8B-Instruct", "Hu BERT", "Hi Fi-GAN vocoder", "GPT-4o", "Cosy Voice-300M-SFT", and "VITS", often with citations to the original papers. However, it does not provide specific software version numbers for ancillary software components (e.g., Python, PyTorch, CUDA versions) needed for replication.
Experiment Setup	Yes	In the first stage, we train the speech adapter and the LLM with a batch size of 32 for 3 epochs. We use a cosine learning rate scheduler with the first 3% of steps for warmup, and the peak learning rate is set to 2e-5. In the second stage, we train the speech decoder, using the same batch size, number of steps, and learning rate scheduler as the first stage, but with the peak learning rate set to 2e-4. The speech adapter performs a 5 downsampling on the speech representations. The speech decoder consists of 2 Transformer layers with a hidden dimension of 4096, 32 attention heads, and a feed-forward network dimension of 11008... The upsample factor λ is set to 25. For the minimum unit chunk size Ωinput to the vocoder, we set Ω= + in the offline scenario, meaning we wait for the entire unit sequence to be generated before inputting it to the vocoder for speech synthesis. In the streaming scenario, we adjust the value of Ωwithin the range of [10, 20, 40, 60, 80, 100] to control the response latency of the model.