Multilingual Speech Translation

Description

While multilingual translation is an established task, until recently, few parallel resources existed for speech translation and most remain only for translation from English speech. Multilingual models enable transfer from related tasks, which is particularly important for low-resource languages; however, parallel data between two otherwise high-resource languages can often be rare, making multilingual translation and zero-shot translation important for many resource settings.

In addition to parallel speech and translations, many sources of data may be useful for speech translation: monolingual speech and transcripts, parallel text, and data from other languages or language pairs. While cascades of separately trained automatic speech recognition (ASR) and machine translation (MT) models can leverage all of these data sources, how to most effectively do so with end-to-end models remains an open and exciting research question.

Motivated by the above, the multilingual speech translation task will provide data for two conditions: supervised, and zero-shot. We will provide speech and transcripts for four languages (Spanish, French, Portuguese, Italian) and translations in a subset of five languages (English, Spanish, French, Portuguese, Italian) as shown below. Zero-shot language pairs will have ASR data released for training but not translations; the target languages may be observed in other language pairs in training. Participants may use the provided resources in any way.

Both constrained submissions (using the provided data only, e.g., no models pretrained on external data) and unconstrained submissions are encouraged and will be evaluated separately. At evaluation time, we will provide speech in the four source languages and ask participants to generate translations in both English and Spanish. Submitting generated transcripts (ASR) for evaluation is not mandatory but strongly encouraged as a useful point of analysis.

We look forward to your creative submissions!

Data

The Multilingual TEDx data is hosted on OpenSLR. The data is derived from TEDx talks and translations. All provided data is segmented and aligned at the sentence-level. The released data contains train, validation, and progress test sets.

Blind evaluation sets for IWSLT2021 have been posted to OpenSLR.

[multilingual speech translation task data image]

Suggested Additional Data for Unconstrained Submissions

These are only suggestions; any publicly available additional data or pretrained models are permitted. We remind participants that use of any of resources beyond Multilingual TEDx will make a submission unconstrained.

Baselines

We provide fairseq baselines for all tasks, and kaldi baselines for ASR.

Submission

Submissions should be compressed in a single .tar.gz file and emailed here.
Only translation into en and es are required. We provide test sets for all pairs seen in training: we will gladly evaluate ASR and translation into the additional pairs for additional analysis if submitted.
Multiple submissions are allowed! If multiple outputs are submitted for one test set, one system must be explicitly marked as primary, or the submission with the latest timestamp will be treated as primary.

File names should follow the following structure:
<participant>.<constrained/unconstrained>.<primary/contrastive>.<src>-<tgt>
e.g., jhu.constrained.primary.es-en.txt

Submissions should consist of plaintext files with one sentence per line, pre-formatted for scoring (detokenized).
Participants must specify if their submission is unconstrained (use additional data beyond what is provided) or constrained (use only the TEDx data provided); constrained and unconstrained systems will be scored separately.
For unconstrained systems, additional data or pretrained models should be specified in the submission email.

Evaluation

We will evaluate translation output using BLEU as computed by SacreBLEU and WER for ASR output. Validation and progress test sets have been added to SacreBLEU. WER will be computed on lowercased text with punctuation removed.

Organizers

Elizabeth Salesky (JHU, USA)
Jake Bremerman (UMD, USA)
Jan Niehues (Maastricht University, Netherlands)
Matt Post (JHU, USA)
Matthew Wiesner (JHU, USA)

Contact

Organizers: Email
Discussion: IWSLT google group