Dialectal Speech Translation

News

March 23, 2022: It has come to our attention that the provided segments.txt in LDC2022E02 contains 5 bad segments (e.g. zero duration, no speech) that should be removed from decoding and scoring. Please use this new segments file to decode. For details see github instructions. Sorry for any potential hassle!

Description

In some communities, two dialects of the same language are used by speakers under different settings. For example, in the Arabic-speaking world, Modern Standard Arabic (MSA) is used as spoken and written language for formal communications (e.g., news broadcasts, official speeches, religion), whereas informal communication is carried out in local dialects such as Egyptian, Moroccan, and Tunisian. This diglossia phenomenon poses unique challenges to speech translation. Often only the “high” dialect for formal communication has sufficient training data for building strong ASR and MT systems; the “low” dialect for informal communication may not even be commonly written.

The goal of this shared task is to advance dialectal speech translation in diglossic communities. Specifically, we focus on Tunisian-to-English speech translation, with additional ASR and MT resources in Modern Standard Arabic. Participants will be provided with the following datasets:

(a) 160 hours of Tunisian conversational speech, with manual transcripts
(b) 200k lines of manual translations of the above Tunisian transcripts into English, making a three-way parallel data (i.e. aligned audio, transcript, translation) that supports end-to-end speech translation models
(c) 1200 hours of Modern Standard Arabic (MSA) broadcast news with transcripts for ASR, available from MGB-2 (Specifically, MGB-2 contains an estimated 70% MSA, with the rest being a mix of Egyptian, Gulf, Levantine, and North African dialectal Arabic. All of the MGB-2 train data is allowed.)
(d) ~42,000k lines of bitext in MSA-English for MT, available for download from OPUS (Opensubtitles, UN, QED, TED, GlobalVoices, News-Commentary). For convenience, these six corpora is packaged in a single 2GB tar file here.

Datasets (a) and (b) are new resources developed by the LDC, which will be provided to the IWSLT participants at no cost. The development and test sets (~3 hours each) are also three-way parallel and have the same characteristics. These datasets have been manually segmented at the utterance level. Participants will build end-to-end or cascaded systems that take Tunisian speech as input and generate English text as final output.

Participants can build systems for evaluation in any of these conditions:

Basic condition: train on datasets (a) and (b) only. This uses only Tunisian-English resources; the smaller dataset and simpler setup makes this ideal for participants starting out in speech translation research.
Dialect adaptation condition: train on datasets (a), (b), (c), (d). The challenge is to exploit the large MSA datasets for transfer learning while accounting for lexical, morphological, and syntactic differences between dialects. This condition may be an interesting way to explore how multilingual models work in multi-dialectal conditions.
Unconstrained condition: participants may use public or private resources for English and more Arabic dialects besides Tunisian (e.g., CommonVoice, TEDx, NIST OpenMT, MADAR, GALE). Multilingual models beyond Arabic and English are allowed. This condition is cross-listed with the low-resource shared task.

The main evaluation metric will be BLEU on the final English translation; we will also compute WER on Tunisian transcripts for participants who submit cascade systems. This new dataset from LDC is conversational in nature (similar in style to Spanish Fisher/CALLHOME), and should be interesting for both ASR and MT researchers.

The ultimate goal of this shared task is to explore how transfer learning between “high” and “low” dialects can enable speech translation in diglossic communities. Diglossia is a common phenomenon in the world. Besides Arabic vs. its dialects, other examples include Mandarin Chinese vs. Cantonese/Shanghainese/Taiwanese/etc., Bahasa Indonesia vs. Javanese/Sundanese/Balinese/etc., Standard German vs. Swiss German, and Katharevousa vs. Demotic Greek. We imagine that techniques from multilingual speech translation and low-resource speech translation shared tasks will be relevant; we also hope that new techniques that specifically exploit the characteristics of diglossia will be explored.

Obtaining Data

IWSLT participants may obtain the Tunisian-English speech translation data for no cost from LDC. Please sign this form and email it to ldc@ldc.upenn.edu. This 3-way parallel data corresponds to datasets (a) and (b) mentioned in the above Description section, and includes 160 hours and 200k lines worth of aligned Audio, Tunisian transcripts, and English translations.

After you obtain the Tunisian-English speech translation data from LDC, please follow these instructions to generate data splits. For the Basic condition, please see the resulting train files for training, dev files for development, and test1 files for internal unofficial evaluation. A new blind test2 file will be released for official evaluation.

For the Dialect adaptation condition, please add any of the MGB-2 and OPUS bitext referenced above. For the Unconstrained condition, feel free to use any resource.

Baseline Models

Feel free to build upon the baseline models in ESPnet provided by CMU WAVLab. Here are the recipes for the basic condition: ASR model and ST model. The models are also downloadable from Huggingface.

If you would like to share your baseline models here for other colleagues to use during the evaluation campaign, please contact Kevin Duh.

Submission

Participants will receive email from LDC with instructions for downloading the evaluation set. The evaluation set will include a ~~segments.txt~~ (one utterance per line, with file-ids and start/end times) and the submission of translation outputs should be ordered in the same way. (Update 3/23/2022: Please use this new segments file, which removes 5 bad lines).</span> Submissions should be compressed in a single .tar.gz file and emailed to x@cs.jhu.edu (where x=kevinduh), with “IWSLT 2022 Dialect Shared Task Submission” in the title; you will receive a confirmation of receipt within a day. If multiple outputs are submitted for one test set, one system must be explicitly marked as primary, or the submission with the latest timestamp will be treated as primary.

File names for translation outputs should follow the following structure:
<participant>.st.<condition>.<primary/contrastive1/contrastive2>.<src>-<tgt>.txt
e.g., gmu.st.basic.primary.aeb-eng.txt for translation outputs.

File names for speech recognition outputs should follow the following structure:
<participant>.asr.<condition>.<primary/contrastive1/contrastive2>.<src>.txt
e.g., gmu.asr.basic.primary.aeb.txt for ASR outputs.

The <condition> tag should be one of the following: “basic” for basic condition, “adaptation” for dialect adaptation condition, and “unconstrained” for unconstrained condition. Submissions should consist of plaintext files with one sentence per line, following the order of the test set segment file, pre-formatted for scoring (detokenized). The official BLEU score will use lower-case and no punctuation, following the “norm” files in the setup instructions. We ask that the participants include a (very) short system desciption in the submission email.

Organizers

Kevin Duh (Johns Hopkins University)
Paul McNamee (Johns Hopkins University)
Kenton Murray (Johns Hopkins University)

Please contact iwslt-evaluation-campaign@googlegroups.com for questions and clarifications about this task.