reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

High-Fidelity Simultaneous Speech-To-Speech Translation

Authors: Tom Labiausse, Laurent Mazaré, Edouard Grave, Alexandre Défossez, Neil Zeghidour

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples1 as well as models and inference code.2 4. Experiments 4.3. Evaluation metrics and baselines 4.6. Results Table 1. Comparison with offline baselines. We also report performance from a closed-source streaming model (*) as it uses the same evaluation protocol. Table 2. Objective comparison of Hibiki with Stream Speech (Zhang et al., 2024a) and Seamless (Barrault et al., 2023). Table 3. Human evaluation. Raters report Mean Opinion Scores (MOS) between 1 and 5. Ground-truth is real human interpretation. Table 4. Ablations.
Researcher Affiliation	Industry	Tom Labiausse 1 Laurent Mazar e 1 Edouard Grave 1 Alexandre D efossez 1 Neil Zeghidour 1 1Kyutai, Paris, France. Correspondence to: Hibiki <EMAIL>.
Pseudocode	No	The paper describes methods and architectural details using mathematical equations and textual explanations, but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	We provide examples1 as well as models and inference code.2 2https://github.com/kyutai-labs/hibiki
Open Datasets	Yes	We will release our code, models, and a high quality 900 hours synthetic dataset. We evaluate models on the Fr-En task of CVSS (Jia et al., 2022b). While it is the standard benchmark for S2ST and allows comparisons with previous models, we observe that 99% of its sequences are shorter than 10 seconds. We thus extend our evaluation to long-forms. Jia, Y., Ramanovich, M. T., Wang, Q., and Zen, H. CVSS corpus and massively multilingual speech-to-speech translation. In Calzolari, N., B echet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., and Piperidis, S. (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, pp. 6691 6703. European Language Resources Association, 2022b.
Dataset Splits	Yes	The optimal parameters are cross validated independently for each dataset using a held-out 8% of Audio-NTREX and the valid split of CVSS-C.
Hardware Specification	Yes	Figure 7 shows that Hibiki remains faster than real-time on a H100 even when processing 320 sequences in parallel (or 160 with classifier-free guidance). Figure 8 shows inference traces of Hibiki-M on an i Phone 16 Pro.
Software Dependencies	No	The paper mentions several tools and models like Whisper (Radford et al., 2023; Louradour, 2023) large-v3 model, Py SBD (Sadvilkar & Neumann, 2020), MADLAD-3B (Kudugunta et al., 2023), Wav LM (Chen et al., 2022), and XCOMET-XL for COMET scores, but it does not specify explicit software library versions (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup	Yes	We train a French-English speech translation system through the following steps, each with a cosine learning rate schedule and Adam W (Loshchilov & Hutter, 2019), with a weight decay of 0.1, and momentum parameters of (0.9, 0.95). Text pretraining. We first pretrain the Temporal Transformer from scratch on multilingual text-only data using next token prediction, for 600K steps, with a batch of 1,024 sequences of length 4,096. We use a cosine learning rate schedule, with 2K warmup steps and a maximum value of 4.8 10 4. Audio pretraining. Starting from the pretrained text model, we perform an audio pretraining on non-parallel French and English data with a single stream as done by D efossez et al. (2024). We train for 1,450K steps with a batch size of 144 and a learning rate of 2 10 4. Speech translation training. We train for 150K steps with a batch size of 96, a learning rate of 3 10 5 and compute the loss on both the source and the target streams. Speech translation fine-tuning. We fine-tune for 8K steps with a batch size of 8, a learning rate of 2 10 6, conditional training on the speaker similarity, special EOS tokens, and apply the loss to both streams. The optimal parameters are γ = 3.0, a temperature of 0.8, top-k of 250 for audio tokens and 50 for text tokens for Audio-NTREX. On CVSS, the same configuration is used except for text tokens that are sampled with a temperature of 0.1.