Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Authors: Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, Long Ma
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3. Experiments 3.1. Setups 3.1.1. DATASETS 3.1.2. MODEL CONFIGURATION 3.1.3. TRAINING 3.2. Results on speech input 3.3. Results on speech output 3.4. Results on spoken question answering 3.5. Analysis on end-to-end latency |
| Researcher Affiliation | Collaboration | 1Tencent, China 2ASLP@NPU, China 3Nanjing University, China. Correspondence to: Long Ma <EMAIL>. |
| Pseudocode | No | The paper describes methods and architectures (e.g., in Figure 1, 2, 3, 4) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions several tools and models (e.g., Qwen2-7B-Instruct, Ti Codec, edge-tts, paraformer-zh) and provides links to their repositories or HuggingFace pages. However, it does not provide an explicit statement or link for the source code of Freeze-Omni itself, nor does it state that the code for their methodology is released. |
| Open Datasets | Yes | In this paper, we only randomly selected 60,000 multi-round Q&A data from moss-003-sft-data 2 and used the backbone LLM to generate new answers to replace its original one. We used a zero-shot TTS system to synthesize its text into speech. For the modeling of speech input of Freeze-Omni, we used 110,000h of internal speech-text paired ASR data, including both Chinese and English, in stage 1 and stage 2. In stage 3, we used the pairing of speech input and text output of the multi-round Q&A data mentioned above. For the modeling of the speech output of Freeze-Omni, we used about 3,000h of text-speech paired data generated by a zeroshot TTS system in stages 1 and 2. In stage 3, we used the pairing of text input and speech output of the multi-round Q&A data mentioned above. To demonstrate the intelligence of Freeze-Omni, we verified the accuracy of spoken question answering on three sets: LlaMA-Questions6 (Nachmani et al., 2023), Web Questions7 (Berant et al., 2013), and Trivia QA8 (Joshi et al., 2017). moss-003-sft-data: https://huggingface.co/datasets/fnlp/ moss-003-sft-data LlaMA-Questions: https://github.com/google-research-datasets/LLAMA1-Test-Set Web Questions: https://huggingface.co/datasets/Stanford/web_questions Trivia QA: https://nlp.cs.washington.edu/triviaqa/ aishell-1 (Bu et al., 2017), test net (Zhang et al., 2022), test meeting (Zhang et al., 2022) are Mandarin evaluation sets, measured in CER (%), while {dev-clean,dev-other,test-clean,test-other} (Panayotov et al., 2015) are English evaluation sets, measured in WER (%). |
| Dataset Splits | No | The paper mentions using specific datasets for training and evaluation, e.g., "60,000 multi-round Q&A data from moss-003-sft-data", "110,000h of internal speech-text paired ASR data", and "1,000 evaluation utterances". However, it does not explicitly provide details about how these datasets were split into training, validation, or test sets for the model's own training process (e.g., percentages or exact counts for splits of their training data). The evaluation sets (LlaMA-Questions, Web Questions, Trivia QA, aishell-1, test net, test meeting, dev-clean, dev-other, test-clean, test-other) are mentioned as external evaluation benchmarks, not internal splits of their training data. |
| Hardware Specification | No | All the experiments were completed on 8 GPUs. |
| Software Dependencies | No | The paper mentions several software components like Qwen2-7B-Instruct, Ti Codec, Adamw optimizer, pytorch, paraformer-zh, and edge-tts. However, it does not provide specific version numbers for these software libraries or frameworks, which are crucial for reproducibility. |
| Experiment Setup | Yes | In the training process, we used the Adamw (Loshchilov & Hutter, 2017) optimizer with a warm-up learning rate scheduler, and different learning rates were used in different stages. The learning rates used in the three stages of the modeling of speech input are 2e-4, 1e-4, and 6e-4 respectively. The learning rates used in stages 2 and 3 of the modeling of speech output are both 5e-5, and the training hyperparameters used in stage 1 are the same as those in Ti Codec. All the experiments were completed on 8 GPUs. Speech Encoder We used a multi-layer convolution with 4-times downsampling and 24 layers of transformers with a hidden size of 1024. The adapter consists of a multiconvolution layer with 2 times downsampling. The number of parameters for the speech encoder is approximately 350M, with an output frame rate of 12.5 Hz. The input of the speech encoder is the mel-filter bank feature with a 25ms window size and 10ms shift. Speech Decoder We used Ti Codec4 (Ren et al., 2023) as the codec model, and we customized the configuration so that the size of the codebook is 1024 with a single-codebook and the frequency of the speech token 40Hz. For the speech decoder part, the NAR (Prefix) speech decoder and the AR speech decoder are 4-layer Llama decoder layers with a hidden size of 896. The number of parameters for the speech decoder is approximately 120M, and the output sample rate of the codec model is 24000Hz. Both Freeze-Omni and Qwen2-7B-Instruct use greedy search in the generation stage with zero-shot |