This task focuses on automatic dubbing: translating the speech in a video into a new language such that the new speech is natural when overlayed on the original video.

Participants will be given original videos (like this), along with their transcripts (e.g. “Sein Hauptgebiet waren es, romantische und gotische Poesie aus dem Englischen ins Japanische zu übersetzen.”):

And will be asked to generate new audio (in the target language) for the video which when overlayed on the original video looks like this:

Automatic dubbing is a very difficult/complex task. A unique aspect of dubbing is isochrony which refers to the property that the speech translation is time aligned with the original speaker’s video. When the speaker’s mouth is moving, a listener should hear speech; likewise, when their mouth isn’t moving, a listener should not hear speech.


We will just operate in an unconstrained setting more inline with real world conditions where additional speech and parallel data is likely available. Participants may use any public or private datasets or pre-trained models.

German - English

We follow last year (2023) release of training and test data that can be found here.

The training data for German-English direction is derived from CoVoST2 and consists of:

  • Source (German) text
  • Desired target speech durations (e.g. 2.1s of speech, followed by a pause, followed by 1.3s of speech)
  • Target (English) phonemes and durations corresponding to a translation which adheres to the desired timing

The test data consist of videos of native speakers reading individual German sentences from the CoVoST-2 test set.

English - Chinese

This year, we are adding a new language direction for dubbing, English-Chinese. In collaboration with subtitle task, we will use English dev set videos as described here and the test set is here.


In German-English direction, we provide a complete baseline (used to create the videos in the description above) as a starting point for participants. The baseline uses target factors to keep track of remaining durations and pauses while generating phonemes and durations in the target language. Those phones and durations are then passed through a publically available text-to-speech system.

In English-Chinese, baseline systems are TBD.


Submissions should be uploaded here. Please submit one zip file per team.

Submissions should contain dubbed videos with filenames matching the original (non-dubbed) test set files. If you submit more than one system, please designate your primary submission.


Participant teams will submit English speech for a set of German videos, each containing one sentence from the CoVoST-2 test set. The new audio will be overlayed on the original video and the resulting video will be judged for it’s overall quality (both isochrony and translation quality). Exact details TBD.

For English-Chinese direction, participants will be asked to generate target speech audio on subtitle test sets.

We will report automatic metrics for each submission, for both machine translation quality and isochrony adherence. Human evaluations are TBD.


  • Brian Thompson (Amazon)
  • Prashant Mathur (AWS AI Labs)
  • Xing Niu (AWS AI Labs)


{pramathu,brianjt} AT amazon DOT com