This task focuses on automatic dubbing: translating the speech in a video into a new language such that the new speech is natural when overlayed on the original video.
Participants will be given German videos like this, along with their transcripts (e.g. “Sein Hauptgebiet waren es, romantische und gotische Poesie aus dem Englischen ins Japanische zu übersetzen.”):
And will produce videos like this, dubbed into English:
Automatic dubbing is a very difficult/complex task, and for this shared task we will focus on the characteristic which is most unique to dubbing: isochrony. Isochrony refers to the property that the speech translation is time aligned with the original speaker’s video. When the speaker’s mouth is moving, a listener should hear speech; likewise, when their mouth isn’t moving, a listener should not hear speech.
To make this task accessible for small academic teams with limited training resources, we make some simplifications: First, we assume the input speech has already been converted to text using an ASR system and the desired speech/pause times have been extracted from the input speech. Second, to alleviate the challenges of training a TTS model, the output is defined to be phonemes and their durations. These phonemes and durations will be played through this open-source text-to-speech model to produce the final speech.
To illustrate, here’s an example in which “hallo! wei gehts?” is translated to “hi! how are you?” such that the output will fit in the desired target speech durations of 0.4s and 1.3s, with a pause in between:
Official training and test data can be found here.
The training data is derived from CoVoST2 and consists of:
- Source (German) text
- Desired target speech durations (e.g. 2.1s of speech, followed by a pause, followed by 1.3s of speech)
- Target (English) phonemes and durations corresponding to a translation which adheres to the desired timing
The test data consist of videos of native speakers reading individual German sentences from the CoVoST-2 test set.
In order to make the shared task approachable for small academic teams, we will have a constrained setting in which participants may use only the official data listed above and may not use any pretrained models.
Additionally, we will have an unconstrained setting more inline with real world conditions where additional speech and parallel data is likely available. In the unconstrained setting, participants may use any public or private datasets or pre-trained models.
We provide a complete baseline (used to create the videos in the description above) as a starting point for participants. The baseline uses target factors to keep track of remaining durations and pauses while generating phonemes and durations in the target language. Those phones and durations are then passed through a publically available text-to-speech system.
Submissions should be uploaded here.
Submissions should be packed in a compressed file with the following naming convention: dubbing-iwslt-2023_[participant-name].tar.gz. Compressed file should contain a root folder with your primary and contrastive system submissions (as folders). Each folder should include two sub directories for
subset2folders should contain the dubbed English speech.
- Organizers will overlay the English audio over the original German videos.
- The root folder should contain a README with the following information
- Brief description of each system submitted, if submitting multiple system indicate which one to use as a primary system for evaluation by the organizers
- Optionally, participants can also report system performance with metrics as mentioned here.
- Training data conditions (constrained, unconstrained)
- List of the data sources used for training the system
- Institution and contact person
- Do you consent to make your submission freely available under MIT license for research purposes and human evaluation? (YES/NO)
- If responding YES for the consent request in the README, include MIT license file (see sample file)
Participant teams will submit English speech for a set of German videos, each containing one sentence from the CoVoST-2 test set. The new audio will be overlayed on the original video and the resulting video will be judged for it’s overall quality (both isochrony and translation quality). Exact details TBA.
We will also report automatic metrics for each submission, for both machine translation quality and isochrony adherence. However, human judgements will be the primary evaluation method.
- Brian Thompson (AWS AI Labs)
- Prashant Mathur (AWS AI Labs)
- Alexandra Chronopoulou (Center for Information and Language Processing, LMU)
- Proyag Pal (University of Edinburgh)
brianjt at amazon dot com