Description

The advent of large language models (LLMs) offers unprecedented opportunities to address traditional natural language processing (NLP) tasks in real-world scenarios and under diverse data conditions. Spoken Language Translation (SLT), which involves automatically translating spoken audio into text in a different language, is no exception thanks to the possibility to fine-tune powerful LLMs for specific tasks, domains, and languages, or to employ them in zero-shot settings when suitable adaptation data is unavailable.

The goal of the Offline Speech Translation Task at IWSLT, the one with the longest-standing tradition at the conference, is to provide a stable evaluation framework for tracking technological advancements in SLT, with a focus on unconstrained speech translation—free from the temporal and structural constraints imposed by tasks such as simultaneous translation or subtitling. To this end, while maintaining the overall task formulation is essential, over the years the emphasis has shifted towards incrementally raising the task’s difficulty to better reflect real-world needs, including the translation of new and diverse languages, domains, and speaking styles.

In this spirit, this year’s edition aims to:

  • include a new and challenging language, Arabic (the full list of the new language directions will be made available soon);
  • offer a varied scenario in terms of domains (news, physical training sessions, and TV series), speaking styles, and recording conditions (e.g., single speakers, multiple overlapping speakers, background noise, accent data);
  • promote the development and use of flexible systems capable of operating in this multi-domain scenario, without resorting to ad-hoc, domain-specialized models.

Similar to last year, the task will provide the opportunity to submit custom extensions to standard offline test sets. These sets are designed to focus on specific aspects of the SLT output that are typically overlooked by traditional evaluation methods.

The system’s performance will be evaluated with respect to its capability to produce translations similar to the target-language references. Such similarity will be measured in terms of multiple automatic metrics: COMET, BLEURT, BLEU, TER, and characTER. The submitted runs will be ranked based on the COMET calculated on the test set by using automatic resegmentation of the hypothesis based on the reference translation by mwerSegmenter. The detailed evaluation script can be found in the SLT.KIT. Moreover, to meet the requests of last year’s participants, a human evaluation will be performed on the best-performing submission of each participant.

Evaluation Conditions

Both cascade and end-to-end models will be evaluated. We kindly ask each participant to specify at submission time if a cascade or an end-to-end model has been used.

In this task, we use the following definition of end-to-end model:

  • No intermediate discrete representations (e.g., source language transcripts like in cascade or target languages like in rover)
  • All parameters/parts that are used during decoding need to be trained on the end2end task (may also be trained on other tasks -> multitasking ok, LM rescoring is not ok)

All the systems will be evaluated on the combination of the different test tests (depending on the language directions) and each specific test set. It is important to note that all the test sets will be released together, but specific information to identify the different test sets will be associated with the data. Each audio file will have a clear identifier of the type of data: e.g. TEDtalk_1.wav, ACL_1.wav, Press_1.wav. More detailed information will be released with the test sets.

Test Data

Coming Soon!

Past Editions Development Data

The development data is not segmented using the reference transcript. The archives contain segmentation into sentence-like segmentation using automatic tools. However, the participants might also use a different segmentation. The data is provided as an archive with the following files ($set e.g. IWSLT.TED.dev2010):

  • $set.en-de.en.xml: Reference transcript (will not be provided for evaluation data)
  • $set.en-de.en.xml: Reference translation (will not be provided for evaluation data)
  • CTM_LIST: Ordered file list containing the ASR Output CTM Files (will not be provided for evaluation data) (Generated by ASR systems that use more data)
  • FILE_ORDER: Ordered file list containing the wav files
  • $set.yaml: This file contains the time steps for sentence-like segments. It is generated by the LIUM Speaker Diarization tool.
  • $set.h5: This file contains the 40-dimensional Filterbank features for each sentence-like segment of the test data created by XNMT.
  • The last two files are created by the following command: python -m xnmt.xnmt_run_experiments /opt/SLT.KIT/scripts/xnmt/config.las-pyramidal-preproc.yaml

Training Data and Data Conditions

Coming Soon

Submission Guidelines

  • Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run for each track. All other run submissions are treated as CONTRASTIVE runs. In the case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.
  • Submissions must be packaged as a gzipped TAR archive (see format below) and sent as an email attachment to iwslt_offline_task_submission@fbk.eu.
  • The TAR archive should include in the file name the type of system (cascade/end-to-end) used to generate the submission
  • Each run has to be stored as a plain text file with one sentence per line
  • Scoring will be case-sensitive and will include punctuation. Submissions have to be in UTF-8. Tags such as applause, laughing, etc are not considered during the evaluation.

TAR archive file structure:

< UserID >/< Set >.< LangDir >.< Task >.< UserID >.primary.txt  
  /< Set >.< LangDir >.< Task >.< UserID >.contrastive1.txt  
  /< Set >.< LangDir >.< Task >.< UserID >.contrastive2.txt  
  /...  

where:
< UserID > = user ID of the participant used the short name chosen in the registration form (e.g. the name of your institution)
< Set > = IWSLT21.SLT.tst2021
< LangDir > = en-de/zh/ja, using language identifiers (LIDs) as given by ISO 639-1 codes
< Task > = OfflineTask.
For example, FBK/IWSLT21.SLT.tst2021.en-de.OfflineTask.FBK.primary.txt

All the submissions should be sent to this address: iwslt_offline_task_submission@fbk.eu

The email should include the following information:

  • Institute:
  • Contact Person:
  • Email:
  • Data condition: Constrained/Unconstrained
  • Segmentation: Own/Given
  • Brief abstract about the system:
  • Multilingual: Yes/No
  • Do you want to make your submissions freely available for research purposes? (yes/no)

Contacts

Chairs: Matteo Negri (FBK, Italy), Marco Turchi (Zoom, Germany)

Discussion: iwslt-evaluation-campaign@googlegroups.com

Organizers

Sebastian Stüker (Zoom, Germany)
Jan Niehues (KIT, Germany)
Tsz Kin Lam (The University of Edinburgh, the United Kingdom)
Barry Haddow (The University of Edinburgh, the United Kingdom)\