For faster loading of audio samples, we recommend using Google Chrome.

Simultaneous Speech-to-Speech Translation Without Aligned Data

Abstract. Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech data. We provide examples as well as models and we release a benchmark containing 15h of multilingual data for speech translation evaluation.

In the Wild Examples 🇫🇷🇪🇸🇵🇹🇩🇪

Source: The legendary Paris 2024 Olympic Games of Léon Marchand. - Eurosport France	Source: Biathlon 2025: Franziska Preuß wins her first World Championship. - Eurosport Germany
Source: Australian Open 2026 Final: Carlos Alcaraz vs. Novak Djokovic. - Eurosport España	Source: Iuri Leitao and Rui Oliveira win gold for Portugal at the Paris 2024 Olympics. - Facebook

Multistream Visualization

The source audios (from our long-form evaluation dataset Audio-NTREX-4L) and translated versions are on different channels. The volume of the sources are reduced so that it's easier to hear the translations.

Short-form Simultaneous Translations

The source audios come from our Europarl-ST evaluation data.

Source language	Source	Hibiki-Zero	Seamless

Long-form Simultaneous Translations

The source audios come from taken from our Audio-NTREX-4L evaluation dataset.

Source language	Source	Hibiki-Zero	Seamless

Short-form Simultaneous Translations from Italian

The source audios come from our Europarl-ST evaluation data. Hibiki-Zero-IT denotes our model adapted for translation from Italian with less than 1000 hours of Italian-to-English data.

Source language	Source	Hibiki-Zero-IT	Seamless

This page was adapted from the SoundStorm project page.