Dialectal and Low-resource track

Description

The goal of this shared task is to benchmark and promote speech translation technology for a diverse range of dialects and low-resource languages. While significant research progress has been demonstrated recently on popular datasets, many of the world’s dialects and low-resource languages lack the parallel data at scale needed for standard supervised learning. We will likely require creative approaches in leveraging disparate resources.

For example, to translate dialectal speech such as Tunisian Arabic, one may leverage existing speech and text resources in Modern Standard Arabic. Or, to translate a low-resource language such as Tamasheq, one may need to leverage word-level translation resources and raw audio.

We will provide training and evaluation data for 8 typologically diverse language-pairs. Participants are free to participate in any number of language-pairs in this track, but we highly encourage participation in as many as possible. We welcome both dedicated systems that are designed to a single language-pair, as well as general recipes aimed at improving speech translation broadly for a wide typology of languages.

General Information for All Language-Pairs

The submission format will be standardized across all language-pairs. Participants can submit systems under two conditions:

Constrained condition: systems are trained only on the datasets provided by the organizers (listed below)
Unconstrained condition: systems can be trained with any resource, including pre-trained models. Multilingual models are allowed.

Information about data and baselines are provided in the sections specific to each language pair.

Data and Baselines

Dialectal Arabic to English (ara-eng)

This language pair will focus on evaluating performance on two Arabic vernaculars:

Tunisian (ISO-3 code: aeb)
North Levantine (ISO-3 code: apc)

We point the participants to training data across different Arabic varieties:

The aeb-eng training data are the same as the one used in the IWSLT 2022 and 2023 tracks: https://iwslt.org/2022/dialect. We suggest you follow the train/dev/test1 split instructions according to the linked webpage.
The apc-eng validation/testing data (with transcriptions) can be found here. Participants are provided with about 120k lines of multi-parallel North Levantine-MSA-English textual data, that can be downloaded from the LINDAT/CLARIAH-CZ Repository. For the speech data, we recommend two LDC resources: BBN/AUB DARPA Babylon Levantine corpus (Speech + Transcript) and the Levantine Arabic QT Training Data Set 5 corpus (Speech + Transcript).

[Feb 8 Update:] We have worked with LDC to make sure participants can access the below-mentioned Tunisian-English data! See below. IWSLT participants may obtain the Tunisian-English speech translation data for no cost from LDC. Please sign this form and email it to ldc@ldc.upenn.edu (see instructions in the form itself). This 3-way parallel data corresponds to 160 hours and 200k lines worth of aligned audio in Tunisian speech, Tunisian transcripts, and English translations. All datasets have been manually segmented at the utterance level.

We also provide links to speech recognition datasets that include Arabic data:

OpenSLR Resource SLR46
OpenSLR Resource SLR48
OpenSLR Resource SLR108
OpenSLR Resource SLR132

[April 2 Update:] THE TEST DATA FOR 2024 IS NOW AVAILABLE HERE

Bemba to English (bem-eng)

Bemba is a Bantu language, spoken by over 10 million people in Zambia and other parts of Africa.

Data are based on the corpus described in this paper, providing 180 hours of Bemba speech, along with transcriptions and translations in English. They are available for download in this Github link.

Additional Bemba speech data (with transcriptions) are available here:

BembaSpeech data paper
ZambeziVoice data paper

[April 2 Update:] THE TEST DATA FOR 2024 IS NOW AVAILABLE HERE

Bhojpuri to Hindi (bho-hin)

Bhojpuri belongs to the Indo-Aryan language group. It is dominantly spoken in India’s western part of Bihar, the north-western part of Jharkhand, and the Purvanchal region of Uttar Pradesh. As per the 2011 Census of India, it has around 50.58 million speakers. Bhojpuri is spoken not just in India but also in other countries such as Nepal, Trinidad, Mauritius, Guyana, Suriname, and Fiji. Since Bhojpuri was considered a dialect of Hindi for a long time, it did not attract much attention from linguists and hence remains among the many lesser-known and less-resourced languages of India.

IWSLT participants may obtain the Bhojpuri-Hindi speech translation data without any cost. Please sign this form and email it to info@panlingua.co.in. This corpus consists of 25 hours of audio speech data from the news domain and translations into Hindi text.

We point participants to additional Bhojpuri audio data (with transcriptions), parallel and monolingual corpora from here:

[April 2 Update:] THE TEST DATA FOR 2024 WILL BE SENT TO REGISTERED PARTICIPANTS ONLY. IF YOU HAVE NOT REGISTERED, PLEASE SEND AN EMAIL TO info@panlingua.co.in

Irish to English (gle-eng)

Irish (also known as Gaeilge) has around 170,000 L1 speakers and “1.85 million (37%) people across the island (of Ireland) claim to be at least somewhat proficient with the language”. In the Republic of Ireland, it is the national and first official language. It is also one of the official languages of the European Union and a recognized minority language in Northern Ireland.

IWSLT participants may obtain the Irish-English speech translation data from here. Please sign this form to get access credentials. This corpus consists of 11 hours of audio speech data and translations into English text.

[April 2 Update:] THE TEST DATA FOR 2024 IS NOW AVAILABLE HERE

Maltese to English (mlt-eng)

[Update Feb 1]: The data and form are available! Maltese is a Semitic language, with a heavy influence from Italian and English. It is spoken mostly in Malta, but also in migrant communities abroad, most notably in Australia and parts of America and Canada. The data release for this shared task consists of over 14 hours (split into dev and train) of audio data, together with their transcription in Maltese and translation into English.

To obtain the data, please fill out this form.

We also point participants to additional Maltese data here:

text corpus used to train BERTu, a Maltese BERT model
MASRI Data speech recognition data
Maltese Language Resource Server

[April 2 Update:] THE TEST DATA FOR 2024 CAN BE DOWNLOADED HERE

Marathi to Hindi (mar-hin)

Marathi is an Indo-Aryan language dominantly spoken in India’s Maharashtra state. It is one of the 22 scheduled languages of India and the official language of Maharashtra and Goa. As per the 2011 Census of India, it has around 83 million speakers which covers 6.86% of the country’s total population. Marathi speakers rank third amongst the languages that are spoken in India.

IWSLT participants may obtain the Marathi-Hindi speech translation data without any cost. Please sign this form and email it to info@panlingua.co.in. This corpus consists of 30 hours of audio speech data from the news domain and translations into Hindi text.

We point participants to additional Marathi audio data (with transcriptions) from here:

[April 2 Update:] THE TEST DATA FOR 2024 IS NOW AVAILABLE HERE

Quechua to Spanish (que-spa)

Quechua is an indigenous language spoken by more than 8 million people in South America. It is mainly spoken in Peru, Ecuador, and Bolivia where the official high-resource language is Spanish. It is a highly inflective language based on its suffixes which agglutinate and found to be similar to other languages like Finnish. The average number of morphemes per word (synthesis) is about two times larger than English. English typically has around 1.5 morphemes per word and Quechua has about 3 morphemes per word.

There are two main region divisions of Quechua known as Quechua I and Quechua II. This data set consists of two main types of Quechua spoken in Ayacucho, Peru (Quechua Chanka ISO:quy) and Cusco, Peru (Quechua Collao ISO:quz) which are both part of Quechua II and, thus, considered “southern” languages. We label the data set with que - the ISO code for Quechua II mixtures.

IWSLT participants may obtain the public Quechua-Spanish speech translation dataset along with the additonal parallel (text-only) data for the constrained task at no cost here: IWSLT 2024 QUE-SPA Data set. IWSLT particpants should also feel free to use any publicly available data for the unconstrained task. This includes a data set of nearly 50 hours of fully transcribed Quechua audio from previous shared tasks. For assistance with the data sets, please email j.ortega@northeastern.edu and rodolfojoel.zevallos@upf.edu.

[April 2 Update:] THE TEST DATA FOR 2024 IS NOW AVAILABLE HERE

[April 18 Update:] THE TEST SCRIPT BLEU and CHRF FOR 2024 IS NOW AVAILABLE HERE

Tamasheq to French (tmh-fra)

Tamasheq is a variety of Tuareg, a Berber macro-language spoken by nomadic tribes across North Africa in Algeria, Mali, Niger and Burkina Faso. It accounts for approximately 500,000 native speakers, being mostly spoken in Mali and Niger. This task is about translating spoken Tamasheq into written French. Almost 20 hours of spoken Tamasheq with French translation are freely provided by the organizers. A major challenge is that no Tamasheq transcription is provided.

Speech-to-translation parallel data: here
Additional audio data (see description in the above Github page): here
The corpus is described in this paper
Baseline systems are available as a SpeechBrain recipe here. The best baseline system gets a BLEU score of 13.89 on the validation data

[April 2 Update:] THE TEST DATA FOR 2024 IS NOW AVAILABLE HERE

Baselines

We provide various baselines:

For Arabic, feel free to build upon the baseline models in ESPnet provided by CMU WAVLab. Here are the recipes for the basic condition: ASR model and ST model. You may also find it helpful to refer to the system description papers in 2022 from CMU, JHU, and ON-TRAC, or the 2022 findings paper.
A baseline system for Tamasheq is available as a SpeechBrain recipe here. This system is the one which got the best result during the IWSLT22 edition with a BLEU score of 5.7.
We also direct the participants to the IWSLT 2023 findings paper for best practices based on last year’s shared task. Papers describing last year’s submitted systems are listed here.

Submission

Participants will submit their final predictions in the following format for all language pairs.

We will primarily focus on speech translation results (“st”), but participants are welcome to share intermediate speech recognition outputs as well (“asr”).

We ask participants to identify their primary submission (which will be used for the final ranking). We will also allow up to two contrastive submissions (“contrastive1”, “contrastive2”).

Please name all files as follows:

[team_name].[task].[type].[label].[language-pair].txt

where:

“team_name” is the name of the team
“task” is one of “st” and “asr”
“type” is one of “constrained” and “unconstrained”
“label” is one of “primary”, “contrastive1”, or “contrastive2”
“language-pair” uses the three-letter ISO codes defined above (e.g. que-spa for Quechua to Spanish)

If participants do not have a constrained/unconstrained system or primary, constrastive1, constrastive2 they should submit only the files that they have, please do NOT repeat submissions.

Submission files should contain translations (or transcriptions) in the format of one per line following the format of the segments file (in sequence) corresponding to the test data splits.

[April 14 Update:] ANNOUNCING A 4 DAY EXTENSION – FINAL DEADLINE IS APRIL 19 EOD

[April 15 Update:] Submission Information – please see below

We ask participants to email their submissions for all language pairs to the organizers in the following email address:

iwslt.2024.lowres.submissions@gmail.com

If submitting a system for Quechua, please cc John Ortega: j.ortega@northeastern.edu If submitting a system for Bhojpuri, Marathi, or Irish, please cc Atul K. Ojha: atulkumar.ojha@insight-centre.org

Evaluation

The official BLEU score will use lower-case and no punctuation, following the “norm” files in the setup instructions.

We will also aim for a human evaluation of the translation outputs.

Organizers

Arabic:

Kenton Murray, Johns Hopkins University
Mateusz Krubiński, Institute of Formal and Applied Linguistics, Charles University (krubinski [email symbol] ufal.mff.cuni.cz)
Pavel Pecina, Institute of Formal and Applied Linguistics, Charles University (pecina [email symbol] ufal.mff.cuni.cz)

Bemba:

Antonios Anastasopoulos, George Mason University (antonis [email symbol] gmu.edu)
Claytone Sikasote, University of Zambia (claytone.sikasote [email symbol] cs.unza.zm)

Bhojpuri, Irish, Marathi:

Atul Kr. Ojha - University of Galway (atulkumar.ojha [email symbol] insight-centre.org)
John P. McCae - University of Galway

Maltese:

Claudia Borg, University of Malta (claudia.borg [email symbol] um.edu.mt)
Rishu Kumar, Charles University (kumarri [email symbol] student.cuni.cz>

Quechua:

John E. Ortega - Northeastern University (j.ortega [email symbol] northeastern.edu)
Rodolfo Zevallos - Universitat Pompeu Fabra (rodolfojoel.zevallos [email symbol] upf.edu)
William Chen - Carnegie Mellon Univerisy (wc4 [email symbol] andrew.cmu.edu)
Ibrahim Ahmed - Northeastern University (i.ahmad [email symbol] northeastern.edu)

Tamasheq:

Yannick Estève - Avignon University (yannick.esteve [email symbol] univ-avignon.fr)

Contact

Chair: Antonios Anastasopoulos, George Mason University

Discussion: iwslt-evaluation-campaign@googlegroups.com

Please use the tag [LowRes] in your email title when emailing the above googlegroup.