Low-Resource Speech Translation
Description
This shared task will focus on the problem of developing speech transcription and translation tools for under-resourced languages, to match the real-world needs of humanitarian organizations. For the vast majority of the world’s languages there exist little speech-translation parallel data at the scale needed to train speech translation models. Instead, in a real-world situation we will have access to limited, disparate resources (e.g. word-level translations, speech recognition, small parallel text data, monolingual text, raw audio, etc).
This shared task, in collaboration with the Translators without Borders, invites participants to build the best speech transcription/translation system they can for transcribing and/or translating between the following language pairs:
- Coastal Swahili (swa) to English (eng)
- Congolese Swahili (swc) to French (fra)
We will provide a list of available:
- parallel translation text corpora for any combination between {swa,swc}x{eng,fra}
- speech recognition datasets for the Swahili varieties (which will be the source side)
- small speech translation datasets for the two language pairs
- word-level terminologies and lexicons
- monolingual textual corpora on all four languages
We allow the use of any pre-trained machine translation, speech recognition, speech synthesis, or speech translation model. We will allow both constrained and unconstrained submissions (a constrained submission will be one using only the data we provide/list). The only data that will not be allowed will be the ones that will be part of the test set.
We invite participants to explore all possible research directions: from pipeline approaches, to end-to-end models, offline or online methods, using other languages through cross-lingual transfer (checkout the shared task on multilingual speech translation, which provides several datasets), data augmentation, and anything else you come up with.
We look forward to your creative submissions!
Allowed Training Data
You may use any of the resources listed below or any other resource you deem appropriate. There is exactly one resource you are not allowed to use: the TICO-19 dataset (which will be part of the evaluation set).
Speech-Transcription-Translation data
- We are releasing small datasets (5k instances) for Swahili speech to English as well as Congolese Swahili to French. These are available for download through this link. The .zip file also includes the validation data in both language pairs. The format of the files follows the format of the multilingual task.
- You may also use any of the allowed data for the Offline Translation task or the Multilingual Translation task, i.e. the Multilingual TEDx CoVost dataset, Europarl-ST, MuST-C.
ASR data
Any speech recognition data can be used. We point to some relevant resources:
- ALFFA dataset, hosted in OpenSLR
- Mozilla Common Voice
- Gamayun Swahili speech samples (requires registration but it is free)
- IARPA Babel Swahili Language Pack, available through LDC (fee of $25)
Translation data
Any parallel data can be used, except for the TICO-19 dataset. We point to some relevant resources:
- English-Swahili (swa) parallel data on OPUS (select ‘en’ and ‘swa’): MultiCCAligned, CCAligned, JW300, etc
- French-Congolese Swahili (swc) parallel data on OPUS (select ‘fr’ and ‘swc’): JW300
- Gamayun kit translated by the Translators without Botders (requires registration but is free).
Pretrained Models
The use of pre-trained models such as wav2vec 2.0, mBART, or similar is also allowed.
Evaluation
The primary task is translation, and hence the submissions will be evaluated using standard automatic translation metrics (e.g. BLEU and chrF++ as computed by SacreBLEU). To ensure fair comparisons in the shared task’s final rankings, though, we will distinguish systems using pre-trained models and those who do not.
In addition, if the participants’ systems also produce transcriptions for the source utterances (which would be the case for pipeline/cascade or multitask systems), we will invite their submission and also evaluate on ASR quality using standard ASR metrics (WER).
Blind evaluation sets are available here.
Evaluation Dates
Note that the evaluation for the Low-Resource Shared Task will start a week after the the other shared tasks. We expect the test set to be released around April 12th, and we will notify the registered participants when the test sets become available.
Submission
Submissions should be compressed in a single .tar.gz file and emailed here, with “IWSLT 2021 Low-Resource Shared Task Submission” in the title.
We would like to see outputs for both test sets. If multiple outputs are submitted for one test set, one system must be explicitly marked as primary, or the submission with the latest timestamp will be treated as primary.
File names for translation outputs should follow the following structure:
<participant>.st.<primary/contrastive>.<src>-<tgt>
e.g.,
gmu.st.primary.swa-eng.txt
for translation outputs.
File names for speech recognition outputs should follow the following structure:
<participant>.asr.<primary/contrastive>.<src>
e.g.,
gmu.asr.primary.swc.txt
for ASR outputs.
Submissions should consist of plaintext files with one sentence per line, following the order of the test set, pre-formatted for scoring (detokenized). We ask that the participants include a (very) short system desciption in the submission email.
Organizers
- Antonios Anastasopoulos (George Mason University, USA)
- Grace Tang (Translators without Borders)
- Will Lewis (University of Washington, USA)
- Sylwia Tur (Appen, USA)
- Rosie Lazar (Appen, USA)
- Marcello Federico (Amazon, USA)
- Alex Waibel (CMU, USA)
Contact
Organizers: We can be reached by email through the IWSLT google group listed below Discussion: IWSLT google group