Low-resource track

Description

The goal of this shared task is to benchmark and promote speech translation technology for a diverse range of dialects and low-resource languages. While significant research progress has been demonstrated recently on popular datasets, many of the world’s dialects and low-resource languages lack the parallel data at scale needed for standard supervised learning. We will likely require creative approaches in leveraging disparate resources.

The low-resource shared task will, for the first time, involve two tracks:

Track 1: A "traditional" speech-to-text translation track focusing on XX typologically diverse language-pairs.
Track 2: A data track, inviting participants to provide open-sourced speech translation datasets for under-resourced languages.

Participants are free to participate in any number of language-pairs in any of the tracks, but we highly encourage participation in as many as possible. We welcome both dedicated systems that are designed to a single language-pair, as well as general recipes aimed at improving speech translation broadly for a wide typology of languages.

Track 1: Speech-Text Speech Translation

General Information for All Language-Pairs

The submission format will be standardized across all language-pairs. Participants can submit systems under two conditions:

Constrained condition: systems are trained only on the datasets provided by the organizers (listed below)
Unconstrained condition: systems can be trained with any resource, including pre-trained models. Multilingual models are allowed.

Information about data and baselines are provided in the sections specific to each language pair.

Data for each Language Pair

Please see below for data pointers for each language pair.

(North) Levantine Dialectal Arabic to English (pca-eng)

This language pair will focus on evaluating performance on one Arabic vernacular:

North Levantine (ISO-3 code: apc)

We point the participants to training data across different Arabic varieties:

The apc-eng validation/testing data (with transcriptions) can be found here. Participants are provided with about 120k lines of multi-parallel North Levantine-MSA-English textual data, that can be downloaded from the LINDAT/CLARIAH-CZ Repository. For the speech data, we recommend two LDC resources: BBN/AUB DARPA Babylon Levantine corpus (Speech + Transcript) and the Levantine Arabic QT Training Data Set 5 corpus (Speech + Transcript). For validation, please use the validation and test splits of IWSLT 2024. We will provide a new test set for the evaluation period.

We also provide links to speech recognition datasets that include Arabic data:

OpenSLR Resource SLR46
OpenSLR Resource SLR48
OpenSLR Resource SLR108
OpenSLR Resource SLR132

(New April 11) The link to the GitHub repo (with updated section for 2025) is here provides the test set!

Tunisian Arabic Dialect to English (aeb-eng)

This shared task is intended to advance the state of the art of speech transcription and translation for the Tunisian dialect.

Participants will be provided with the following datasets:

(1) 323.73 hours of Tunisian Conversational Telephone Speech (CTS), with manual transcripts;
(2) 167.48 hours of the above data manually translated into English for end-to-end speech translation models;
(3) 08 hours of Tunisian dialect Conversations in train stations with manual transcripts.

Datasets (1) and (2) are made available to the IWSLT participants at no cost by LDC. The development and test sets (~3 hours each) are also three-way Tunisian Conversational Telephone Speech from LDC. Participants will be evaluated using CTS (similar to LDC datasets) test sets for:

(1) Speech Transcription that takes Tunisian speech as input and generates Tunisian text as output;
(2) Speech Translation that takes Tunisian speech as input and generates English text as output.

Participants can build systems for evaluation in any of these conditions:

Constrained: train using only Tunisian-English resources from LDC;
Unconstrained condition: participants may use any additional public or private resources.

Obtaining Data

IWSLT participants may obtain the Tunisian-English speech translation data for no cost from LDC. Please sign this and email it to ldc@ldc.upenn.edu. This 3-way parallel data corresponds to datasets (1) and (2) mentioned in the above Description section. TARIC data set is available by fill out the form available at https://demo-lia.univ-avignon.fr/taric-dataset/ of TARIC

After you obtain the data sets please use the files ids available here https://github.com/fbougares/iwslt25_aeb-eng to generate your dev and internal test sets. The official test set files will be released later for official evaluation.

Baseline Models

Stay tuned, we will be releasing baseline models.

Submission

Participants will receive email from LDC with instructions for downloading the evaluation set. The evaluation set will include a segments.txt (one utterance per line, with file-ids and start/end times) and the submission of translation outputs should be ordered in the same way. Submissions should be compressed in a single .tar.gz file and emailed to fethi DOT bougares AT elyadata.com, with “IWSLT 2025 Dialect Shared Task Submission” in the title; you will receive a confirmation of receipt within a day. If multiple outputs are submitted for one test set, one system must be explicitly marked as primary, or the submission with the latest timestamp will be treated as primary.

File names pattern for translation outputs should follow the following structure:

(1) File names for speech recognition outputs should follow the following structure:
<participant>.asr.<condition>.<primary/contrastive1/contrastive2>.<src>.txt
e.g., lia.asr.constrained.primary.aeb.txt

(1) File names for speech translation outputs should follow the following structure:
<participant>.st.<condition>.<primary/contrastive1/contrastive2>.<src>-<tgt>.txt
e.g., lia.st.unconstrained.primary.aeb-eng.txt

Submissions should consist of plaintext files with one sentence per line, following the order of the test set segment file, pre-formatted for scoring (detokenized). We will be using BLEU and CHRF scores for official evaluation. Submissions are supposed to be lower-cased and without punctuation.

Participants are requested to include a short system description in the submission email.

Organizers

Fethi Bougares (Head of Research @ Elyadata / Associate member @LIA)
Yannick Estève (Full Professor @LIA)
Salima Mdhaffar (Researcher @LIA)
Haroun Elleuch (Phd student Elyadata/LIA)

For questions and clarifications about this task, please write to

fethi DOT bougares @ elyadata.com
yannick DOT esteve @ univ-avignon.fr

Bemba to English (bem-eng)

Bemba is a Bantu language, spoken by over 10 million people in Zambia and other parts of Africa.

Data are based on the corpus described in this paper, providing 180 hours of Bemba speech, along with transcriptions and translations in English. They are available for download in this Github link.

Additional Bemba speech data (with transcriptions) are available here:

BembaSpeech data paper
ZambeziVoice data paper

(NEW April 3) The test set can be accessed here.

Fongbe to French (fon-fra)

Fongbe, a tonal African language, is the most spoken dialect of Benin, by more than 50% of Benin’s population, including 8 million speakers. Fongbe is also spoken in Nigeria and Togo.

This task involves translating spoken Fongbe into written French. The organizers provides nearly 57 hours of spoken Fongbe with corresponding French translations. The data used for this shared task is an extension of the FFTSC corpus The use of this data is restricted to participation in this shared task. To get it, please sign this form and email it to fortune.kponou@imsp-uac.org

Evaluation data are now accessible to the registered participants on the private HF repo dedicated to this fon-fra task

Irish to English (gle-eng)

Irish (also known as Gaeilge) has around 170,000 L1 speakers and “1.85 million (37%) people across the island (of Ireland) claim to be at least somewhat proficient with the language”. In the Republic of Ireland, it is the national and first official language. It is also one of the official languages of the European Union and a recognized minority language in Northern Ireland.

IWSLT participants may obtain the Irish-English speech translation data from here.

(NEW April 5) The test set can be accessed here.

Bhojpuri to Hindi (bho-hin)

Bhojpuri belongs to the Indo-Aryan language group. It is dominantly spoken in India’s western part of Bihar, the north-western part of Jharkhand, and the Purvanchal region of Uttar Pradesh. As per the 2011 Census of India, it has around 50.58 million speakers. Bhojpuri is spoken not just in India but also in other countries such as Nepal, Trinidad, Mauritius, Guyana, Suriname, and Fiji. Since Bhojpuri was considered a dialect of Hindi for a long time, it did not attract much attention from linguists and hence remains among the many lesser-known and less-resourced languages of India.

IWSLT participants may obtain the Bhojpuri-Hindi speech translation data without any cost. This corpus consists of 25 hours of audio speech data from the news domain and translations into Hindi text.

We point participants to additional Bhojpuri audio data (with transcriptions), parallel and monolingual corpora from here:

(NEW April 5) The test set can be accessed here.

Estonian to English (est-eng)

Training and dev data for IWSLT 2025 Estonian-English speech translation task are available here

Training data contains 581647 utterances (1258 hours), and the development set 1601 utterances (3.6 hours). Training data originates from the TalTech Estonian Speech Dataset 1.0 which is a manually transcribed dataset of mostly broadcast data created for training ASR models. All the speech data consists of long-form speech and has been manually transcribed and time-aligned with speech at an utterance level. In the dataset provided here, the long-form recordings have been split up into utterances. The transcripts have been automatically translated to English using Google Translate. Development data contains data from government and municipal press conferences, TV news and TV talk shows and has been manually translated to English. Both original Estonian transcriptions as well as English translations are provided for all utterances.

(NEW April 10) The test set can be accessed here.

Maltese to English (mlt-eng)

[Update Feb 1]: The data and form are available! Maltese is a Semitic language, with a heavy influence from Italian and English. It is spoken mostly in Malta, but also in migrant communities abroad, most notably in Australia and parts of America and Canada. The data release for this shared task consists of over 14 hours (split into dev and train) of audio data, together with their transcription in Maltese and translation into English.

To obtain the data, please fill out this form.

We also point participants to additional Maltese data here:

text corpus used to train BERTu, a Maltese BERT model
MASRI Data speech recognition data
Maltese Language Resource Server

(New April 14) Test data are available here. The link provides pointers to two test sets. We ask participants to submit predictions for both test sets, marking them in the file name appropriately.

Marathi to Hindi (mar-hin)

Marathi is an Indo-Aryan language dominantly spoken in India’s Maharashtra state. It is one of the 22 scheduled languages of India and the official language of Maharashtra and Goa. As per the 2011 Census of India, it has around 83 million speakers which covers 6.86% of the country’s total population. Marathi speakers rank third amongst the languages that are spoken in India.

IWSLT participants may obtain the Marathi-Hindi speech translation data without any cost. This corpus consists of 30 hours of audio speech data from the news domain and translations into Hindi text.

We point participants to additional Marathi audio data (with transcriptions) from here:

(NEW April 7) The test set can be accessed here.

Quechua to Spanish (que-spa)

Quechua is an indigenous language spoken by more than 8 million people in South America. It is mainly spoken in Peru, Ecuador, and Bolivia where the official high-resource language is Spanish. It is a highly inflective language based on its suffixes which agglutinate and found to be similar to other languages like Finnish. The average number of morphemes per word (synthesis) is about two times larger than English. English typically has around 1.5 morphemes per word and Quechua has about 3 morphemes per word.

There are two main region divisions of Quechua known as Quechua I and Quechua II. This data set consists of two main types of Quechua spoken in Ayacucho, Peru (Quechua Chanka ISO:quy) and Cusco, Peru (Quechua Collao ISO:quz) which are both part of Quechua II and, thus, considered “southern” languages. We label the data set with que - the ISO code for Quechua II mixtures.

IWSLT participants may obtain the public Quechua-Spanish speech translation dataset along with the additonal parallel (text-only) data for the unconstrained task at no cost here: IWSLT 2025 QUE-SPA Data set. IWSLT particpants should feel free to use any publicly available data for the unconstrained task. This includes a data set of nearly 50 hours of fully transcribed Quechua audio from previous shared tasks along with the introduction of a new data set this year which is about 8 hours of synthetic (post-edited) translations. For assistance with the data sets, please email j.ortega@northeastern.edu and rodolfojoel.zevallos@upf.edu.

(NEW April 10) The test set can be accessed here.

Submission

(New April 13) Concrete details for the submissions are available!

We will primarily focus on speech translation results (“st”), but participants are welcome to share intermediate speech recognition outputs as well (“asr”).

We ask participants to identify their primary submission (which will be used for the final ranking). We will also allow up to two contrastive submissions (“contrastive1”, “contrastive2”).

Please name all files as follows:

[team_name].[task].[type].[label].[language-pair].txt

where:

“team_name” is the name of the team
“task” is one of “st” and “asr”
“type” is one of “constrained” and “unconstrained”
“label” is one of “primary”, “contrastive1”, or “contrastive2”
“language-pair” uses the three-letter ISO codes defined above (e.g. que-spa for Quechua to Spanish)

If participants do not have a constrained/unconstrained system or primary, constrastive1, constrastive2 they should submit only the files that they have, please do NOT repeat submissions.

Submission files should contain translations (or transcriptions) in the format of one per line following the format of the segments file (in sequence) corresponding to the test data splits.

We ask participants to email their submissions for all language pairs to the organizers in the following email address:

iwslt.2025.lowres.submissions@gmail.com

If submitting a system for Quechua, please cc John Ortega: j.ortega@northeastern.edu If submitting a system for Bhojpuri, Marathi, or Irish, please cc Atul K. Ojha: atulkumar.ojha@insight-centre.org

Ideally, your email body should include a brief description of your system (which the organizers can use/modify for describing your submission in the Findings paper), and a brief explanation if you include multiple files per language.

[April 14 Update:] ANNOUNCING A 4 DAY EXTENSION – FINAL DEADLINE IS APRIL 19 EOD

Evaluation

The official scoring will be based on automatic metrics (COMET, BLEU, chrF++) over lower-cased outputs after removing punctuation. Please follow the “norm” files in the setup instructions.

We will also aim for a human evaluation of the translation outputs for the more competitive systems.

Track 2: Training and Evaluation Data Track

This track aims to empower language communities to contribute to key datasets. These datasets are essential for expanding the reach of spoken language technology to more languages and varieties.

Progress made in translation quality has largely been directed at high-resource languages. Recently, focus has started to shift to under-served languages, and foundational datasets such as FLORES and NTREX have made it easier to develop and evaluate MT models for an increasing amount of languages. The high impact of these components left some in the research community wondering: how do we add more languages to these existing open-source datasets?

The goal of this shared task track is to expand open datasets to more languages. In particular, we are soliciting contributions to Speech Translation Training and Evaluation Datasets, either on text-to-speech or speech-to-speech formats.

To describe and publicise their contributions, tracj participants will be asked to submit a 2-4 page paper to be presented at IWSLT 2025, similar to other Shared Task papers.

Data Submission Requirements

We highly encourage participants to get creative, however we also want to ensure data quality. Please see notes below:

Translations should be performed, wherever possible, by qualified, native speakers of the target language. We strongly encourage verification of the data by at least one additional native speaker.
Dataset card: dataset cards should be attached to new data submissions, detailing precise language information and the translation workflow that was employed. In particular, we ask participants to identify the language with both an ISO 639-3 individual language tag and a Glottocode. The script should be identified with an ISO 15924 script code.
License: We highly encourage new contributions to be released under CC BY-SA 4.0 or other similarly permissive licenses. By contributing data to this shared task, participants agree to have this data released under these terms. At a minimum, data should be made available for research use.
Use of automatic translation or LLMs for data generation: while post-editing of automatic output is allowed, we require that any data submitted for the shared task are 100% verified by humans, if not directly created by humans. Raw, unverified machine translated outputs are not allowed. If using MT, you must ensure that the terms of service of the model you use allow re-using its outputs to train other machine translation models (as an example, popular commercial systems such as DeepL, Google Translate and ChatGPT disallow this).

Paper Submission Requirements

Please follow the general submission requirements, ensuring that your paper includes all necessary details for reproducing your system.

Organizers

Contact

Chair(s):

General Overview:
- Antonios Anastasopoulos, George Mason University
- Kenton Murray, Johns Hopkins University
Bemba:
- Claytone Sikasote, University of Zambia
Arabic:
- Pavel Pecina, Institute of Formal and Applied Linguistics, Charles University (pecina [email symbol] ufal.mff.cuni.cz)
- Fethi bougares, University of Le Mans
Fongbe:
- Yannick Estève, Avignon University (yannick.esteve [email symbol] univ-avignon.fr)
- Fethi Bougares, University of Le Mans
- Salima Mdhaffar, Avignon University
Irish, Bhojpuri, Marathi:
- Atul Kr. Ojha, University of Galway (atulkumar.ojha [email symbol] insight-centre.org)
- John P. McCae, University of Galway
- Yasmin Moslem, Bering Lab (only for Irish language pair)
Estonian:
- Tanel Alumäe, Tallinn University of Technology
- Mark Fishel, University of Tartu
Maltese:
- Claudia Borg, University of Malta (claudia.borg [email symbol] um.edu.mt)
Quechua:
- John E. Ortega, Northeastern University (j.ortega [email symbol] northeastern.edu)
- Rodolfo Zevallos, Universitat Pompeu Fabra (rodolfojoel.zevallos [email symbol] upf.edu)

Discussion: iwslt-evaluation-campaign@googlegroups.com