Offline ST track

Description

The goal of the Offline Speech Translation Task at IWSLT, the one with the longest-standing tradition at the conference, is to provide a stable evaluation framework for tracking technological advancements in Spoken Language Translation (SLT), with a focus on unconstrained speech translation—free from the temporal and structural constraints imposed by tasks such as simultaneous translation or subtitling. To this end, while maintaining the overall task formulation is essential, over the years the emphasis has shifted towards incrementally raising the task’s difficulty to better reflect real-world needs, including the translation of new and diverse languages, domains, and speaking styles.

In this spirit, this year’s edition aims to:

include a new and challenging language, Japanese;
offer a varied scenario in terms of domains (news, business news, and TV series), speaking styles, and recording conditions (e.g., single speakers, multiple overlapping speakers, background noise, accent data);
promote the development and use of flexible systems capable of operating in this multi-domain scenario, without resorting to ad-hoc, domain-specialized models;
explore system’s ability to operate in a “source language agnostic” scenario (newly introduced track, see below) where the input language is unknown.

Systems’ performance will be evaluated with respect to their capability to produce translations similar to the target-language references in the test sets. Such similarity will be measured in terms of multiple automatic metrics: COMET, BLEURT, BLEU, TER, and characTER. As in previous editions of the campaign, the submitted runs will be ranked based on COMET calculated on the test set by using automatic resegmentation of the hypothesis based on the reference translation by mwerSegmenter. The detailed evaluation script can be found in the SLT.KIT. Moreover, as a complement to automatic evaluation, human evaluation will be performed on each participant’s best-performing submission.

Tracks

For this round of the Offline Speech Translation Task, we propose two tracks: language-aware and language-agnostic.

Language-aware: This track follows the traditional format of previous rounds, where participants are challenged with test sets covering a predefined list of language directions. Submissions may be made for any of the following directions:

English -> German: TV series, scientific presentations, call center two-person conversations, YouTube, business news, and accent challenge data.
English -> Chinese: TV series, scientific presentations, call center two-person conversations, YouTube, and business news.
English -> Japanese: TV series, scientific presentations, call center two-person conversations, YouTube, and business news.
English -> Arabic: business news.

Data details can be found on the subtitling task page, with a specific description of the “call center two-person conversations” available here.

Language-agnostic: This is a newly introduced track designed to test a system’s ability to translate speech when the source language is unknown. By removing the requirement for pre-defined source language labels, the track aims to catalyze the development of truly universal models capable of frictionless, human-like understanding, adapting to the speaker, regardless of the language they speak. The language directions covered in this track are:

Source languages: Czech, German, English.
Target languages: English (note: the evaluation includes the English-English ASR direction).

Evaluation Conditions

Both cascade and end-to-end models will be evaluated. We kindly ask each participant to specify at submission time if a cascade or an end-to-end model has been used.

In continuity with past rounds, we use the following definition of end-to-end model:

No intermediate discrete representations (e.g., source language transcripts like in cascade or target languages like in rover)
All parameters/parts that are used during decoding need to be trained on the end2end task (may also be trained on other tasks -> multitasking ok, LM rescoring is not ok)

All the systems will be evaluated using a combination of the different test tests (depending on the language directions). It is important to note that all the test sets will be released together, but specific information to identify the different test sets will be associated with the data. Each audio file will have a clear identifier of the type of data: News_1.wav, ACL_1.wav, Press_1.wav. More detailed information will be released with the test sets.

Test Data

Development Data

Two types of development data are available:

The subtitling task dev data described here.

Data from the previous rounds of the offline task, available here. The development data is not segmented using the reference transcript. The archives contain segmentation into sentence-like segmentation using automatic tools. However, the participants might also use a different segmentation. The data is provided as an archive with the following files:
- $set.en-de.en.xml: Reference transcript (will not be provided for evaluation data)
- $set.en-de.en.xml: Reference translation (will not be provided for evaluation data)
- CTM_LIST: Ordered file list containing the ASR Output CTM Files (will not be provided for evaluation data) (Generated by ASR systems that use more data)
- FILE_ORDER: Ordered file list containing the wav files
- $set.yaml: This file contains the time steps for sentence-like segments. It is generated by the LIUM Speaker Diarization tool.
- $set.h5: This file contains the 40-dimensional Filterbank features for each sentence-like segment of the test data created by XNMT.
- The last two files are created by the following command: python -m xnmt.xnmt_run_experiments /opt/SLT.KIT/scripts/xnmt/config.las-pyramidal-preproc.yaml

Training Data and Data Conditions

A “constrained” setup is proposed as the official training data condition, in which the allowed training data is limited to a medium-sized framework in order to keep the training time and resource requirements manageable. In order to allow participants to leverage large language models and medium-sized resources, we propose a “constrained with large language models” condition, where participants can use the training data allowed in the constrained condition plus any additional Large Language Models as long as it is released under a permissive license. In order to allow the participation of teams equipped with high computational power and effective in-house solutions built on additional resources, an “unconstrained” setup without data restrictions is also proposed.

Constrained training: Under this condition, the allowed training resources are the following ones (note that the list does not include any pre-trained language model):

Data type	src lang	tgt lang	Training corpus (URL)	Version	Comment
speech	en	–	LibriSpeech ASR corpus	v12	includes translations into pt, not to be used
speech	en	–	How2	na
speech	en	–	Mozilla Common Voice	v11.0
speech	en	–	Vox Populi	na	only translation, no transcription
speech-to-text-parallel	en	de, ar, zh, ja	CoVoST	v2
speech-to-text-parallel	en	de	Europarl-ST	v1.1
text-parallel	en	ar	UNPC	v1.0
text-parallel	en	de	Europarl	v10
text-parallel	en	ar, de, ja	Tanzil	v1
text-parallel	en	zh, de, ar, ja	NewsCommentary	v18
text-parallel	en	ar	GlobalVoices	v2018q4
text-parallel	en	ar, zh, de, ja	OpenSubtitles	v2018
text-parallel	en	de	OpenSubtitles	v2018 apptek	partially re-aligned, filtered, with document meta-information on genre
text-parallel	en	ar, zh, de, ja	Tatoeba	v2023-04-12
text-parallel	en	ja	JParaCrawl	na
text-parallel	en	ar	ELRC_2922	v1
text-parallel	en	de	ELRC-CORDIS_News	v1
text-monolingual	–	de	OpenSubtitles with subtitle breaks	v2018-apptek	superset of parallel data, with subtitle breaks and document meta-info on genre, automatically predicted line breaks

Note: this list is identical to the one available in the subtitle task. Some training data are specific for the subtitling task including subtitle boundaries (<eob> and <eol>).

Constrained with Large Language Models training: Under this condition, all the constrained resources plus freely accessible large language models released under a permissive license are allowed.
Unconstrained training: any resource, pre-trained language models included, can be used with the exception of evaluation sets

Submission Guidelines

The evaluation will be performed using the Meetween SPEECHM Evaluation Server. More info are coming in March.

Contacts

Chairs: Matteo Negri (FBK, Italy), Marco Turchi (Zoom, Germany)

Discussion: iwslt-evaluation-campaign@googlegroups.com

Organizers

Sebastian Stüker (Zoom, Germany)
Jan Niehues (KIT, Germany)\