High-Fidelity Simultaneous Speech-To-Speech Translation

Authors: Tom Labiausse, Laurent Mazaré, Edouard Grave, Alexandre Défossez, Neil Zeghidour

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples1 as well as models and inference code.2 4. Experiments 4.3. Evaluation metrics and baselines 4.6. Results Table 1. Comparison with offline baselines. We also report performance from a closed-source streaming model (*) as it uses the same evaluation protocol. Table 2. Objective comparison of Hibiki with Stream Speech (Zhang et al., 2024a) and Seamless (Barrault et al., 2023). Table 3. Human evaluation. Raters report Mean Opinion Scores (MOS) between 1 and 5. Ground-truth is real human interpretation. Table 4. Ablations.
Researcher Affiliation Industry Tom Labiausse 1 Laurent Mazar e 1 Edouard Grave 1 Alexandre D efossez 1 Neil Zeghidour 1 1Kyutai, Paris, France. Correspondence to: Hibiki <EMAIL>.
Pseudocode No The paper describes methods and architectural details using mathematical equations and textual explanations, but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code Yes We provide examples1 as well as models and inference code.2 2https://github.com/kyutai-labs/hibiki
Open Datasets Yes We will release our code, models, and a high quality 900 hours synthetic dataset. We evaluate models on the Fr-En task of CVSS (Jia et al., 2022b). While it is the standard benchmark for S2ST and allows comparisons with previous models, we observe that 99% of its sequences are shorter than 10 seconds. We thus extend our evaluation to long-forms. Jia, Y., Ramanovich, M. T., Wang, Q., and Zen, H. CVSS corpus and massively multilingual speech-to-speech translation. In Calzolari, N., B echet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., and Piperidis, S. (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, pp. 6691 6703. European Language Resources Association, 2022b.
Dataset Splits Yes The optimal parameters are cross validated independently for each dataset using a held-out 8% of Audio-NTREX and the valid split of CVSS-C.
Hardware Specification Yes Figure 7 shows that Hibiki remains faster than real-time on a H100 even when processing 320 sequences in parallel (or 160 with classifier-free guidance). Figure 8 shows inference traces of Hibiki-M on an i Phone 16 Pro.
Software Dependencies No The paper mentions several tools and models like Whisper (Radford et al., 2023; Louradour, 2023) large-v3 model, Py SBD (Sadvilkar & Neumann, 2020), MADLAD-3B (Kudugunta et al., 2023), Wav LM (Chen et al., 2022), and XCOMET-XL for COMET scores, but it does not specify explicit software library versions (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes We train a French-English speech translation system through the following steps, each with a cosine learning rate schedule and Adam W (Loshchilov & Hutter, 2019), with a weight decay of 0.1, and momentum parameters of (0.9, 0.95). Text pretraining. We first pretrain the Temporal Transformer from scratch on multilingual text-only data using next token prediction, for 600K steps, with a batch of 1,024 sequences of length 4,096. We use a cosine learning rate schedule, with 2K warmup steps and a maximum value of 4.8 10 4. Audio pretraining. Starting from the pretrained text model, we perform an audio pretraining on non-parallel French and English data with a single stream as done by D efossez et al. (2024). We train for 1,450K steps with a batch size of 144 and a learning rate of 2 10 4. Speech translation training. We train for 150K steps with a batch size of 96, a learning rate of 3 10 5 and compute the loss on both the source and the target streams. Speech translation fine-tuning. We fine-tune for 8K steps with a batch size of 8, a learning rate of 2 10 6, conditional training on the speaker similarity, special EOS tokens, and apply the loss to both streams. The optimal parameters are γ = 3.0, a temperature of 0.8, top-k of 250 for audio tokens and 50 for text tokens for Audio-NTREX. On CVSS, the same configuration is used except for text tokens that are sampled with a temperature of 0.1.