Offline track

Description

Recent advances in deep learning are giving the possibility to address traditional NLP tasks in a new and completely different manner. One of these tasks is spoken language translation (SLT). For years, SLT has been addressed by cascading an automatic speech recognition (ASR) and a machine translation (MT) system. Recent trends rely on using a single neural network to directly translate the input audio signal in one language into a text in a different language without intermediate symbolic representations, e.g., transcriptions.

The goal of the Offline Speech Translation Task, the one with the longest tradition at IWSLT, is to examine automatic methods for translating audio speech in one language into text in the target language. This has to be done either by exploiting cascaded solutions or end-to-end approaches. Although the results of the last few editions have confirmed that the performance of end-to-end models is approaching that of cascade solutions, it is currently not clear which of the two technologies is more effective. Moreover, all the recent evaluations have been based on test sets extracted from TED talks, which are representative of a relatively simpler application scenario compared to the variety of potential on-field deployments of SLT technology. In this controlled scenario, a single speaker acts out a prepared speech without background noise and interaction with other speakers.

Finally, last year’s edition showed that introducing complexity to the scenario (e.g., including spontaneous speech, terminology, and dialogues) resulted in a clear degradation of the performance of both technologies compared to the use of the classic TED talk test set.

In addition to answering the question if the cascade solution is still the dominant technology, this year we will address an additional research question in the evaluation:

Is the current spoken language translation technology able to deal with more complex scenarios (e.g. spontaneous speech, terminology, different accents, background noise, and dialogues)? In addition to the classic TED talk test set from English into German, the task introduces two more test sets that face more challenging scenarios:
- TV series: in this scenario, multiple persons interact in different scenarios. The speech translation system needs to deal with overlapping speakers, different accents, and background noise.
- Physical training sessions: in this scenario, persons are speaking while practicing in the gym. The speech translation system needs to deal with background noise and an informal speaking style.
- Accent challenge data: accent, a specific pronunciation by people in a particular area, country, or social group, is a unique characteristic in the speech modality. In spite of speaking in the same language, such specific pronunciation could hinder communication between people in different groups. However, accent is rarely examined in our SLT systems. The submitted system, whether it is under “constrained” data conditions or not, would be evaluated on this extra test set. The participants are also welcome to adapt their systems for this robustness challenge on the accent-related data, e.g., VCTK corpus, LibriTTS corpus, and ACL 60/60 evaluation sets.

Similarly to last year, three language directions are proposed in the offline task. Each language direction will be tested in different evaluation scenarios:

English -> German: TED talks, TV series, physical training sessions, and accent challenge data.
English -> Japanese: TED talks.
English -> Chinese: TED talks.

The system’s performance will be evaluated with respect to its capability to produce translations similar to the target-language references. Such similarity will be measured in terms of multiple automatic metrics: COMET, BLEURT, BLEU, TER, and characTER. The submitted runs will be ranked based on the COMET calculated on the test set by using automatic resegmentation of the hypothesis based on the reference translation by mwerSegmenter. The detailed evaluation script can be found in the SLT.KIT. Moreover, to meet the requests of last year’s participants, a human evaluation will be performed on the best-performing submission of each participant.

While evaluating the submitted systems to the official test sets, in this edition the organizers give the possibility to submit additional test suites. The goal of a test suite is to evaluate an SLT system on specific aspects that are generally hidden by the classic evaluation frameworks. More information in the session Test suite. This means that each participant will translate the official test sets and the test suites. While the official evaluation will be based only on the official test sets, the test suites will give the possibility to identify specific and challenging aspects that affect the SLT performance.

🆕 Novelties as in a nutshell:

New test data (accent challenge data)
Test suite evaluation
Novel primary metric: COMET
Test suites

Evaluation Conditions

Both cascade and end-to-end models will be evaluated. We kindly ask each participant to specify at submission time if a cascade or an end-to-end model has been used.

In this task, we use the following definition of end-to-end model:

No intermediate discrete representations (e.g., source language transcripts like in cascade or target languages like in rover)
All parameters/parts that are used during decoding need to be trained on the end2end task (may also be trained on other tasks -> multitasking ok, LM rescoring is not ok)

All the systems will be evaluated on the combination of the different test tests (depending on the language directions) and each specific test set. It is important to note that all the test sets will be released together, but specific information to identify the different test sets will be associated with the data. Each audio file will have a clear identifier of the type of data: e.g. TEDtalk_1.wav, ACL_1.wav, Press_1.wav. More detailed information will be released with the test sets.

Test Data

The test data includes the official offline task data plus the test suite data (see below).

You can download it here:

tst2024

Test Suite

Test suites are custom extensions to standard offline test sets constructed so that they can focus on particular aspects of the SLT output. The goal of the test suite is to investigate specific aspects that are generally omitted by the classic evaluation strategies. Test suites also evaluate these aspects in their custom way. The particular test suite composition and its evaluation are fully on the test suite provider.

If you are interested in submitting a test suite, please send us a link to the data including the audio and a textual file describing the goal of the test suite. The format of the audio files is similar to the format of the test audio in the previous editions: a folder with the WAV files and a textual file containing the order in which the audio files will be processed. To share the test suite link, please use the following email: iwslt_offline_task_submission@fbk.eu

All the test suites will then be merged and made available to the participants in the test set section. Once the translations are received, they will be split according to the test suites and forwarded to the owners of the test suites. An evaluation is expected to be performed on time to be included in the findings paper.

Important date:

The test suite should be submitted by the 1st of March.

For more information about the test suite: iwslt-evaluation-campaign@googlegroups.com

Past Editions Development Data

The development data is not segmented using the reference transcript. The archives contain segmentation into sentence-like segmentation using automatic tools. However, the participants might also use a different segmentation. The data is provided as an archive with the following files ($set e.g. IWSLT.TED.dev2010):

$set.en-de.en.xml: Reference transcript (will not be provided for evaluation data)
$set.en-de.en.xml: Reference translation (will not be provided for evaluation data)
CTM_LIST: Ordered file list containing the ASR Output CTM Files (will not be provided for evaluation data) (Generated by ASR systems that use more data)
FILE_ORDER: Ordered file list containing the wav files
$set.yaml: This file contains the time steps for sentence-like segments. It is generated by the LIUM Speaker Diarization tool.
$set.h5: This file contains the 40-dimensional Filterbank features for each sentence-like segment of the test data created by XNMT.
The last two files are created by the following command: python -m xnmt.xnmt_run_experiments /opt/SLT.KIT/scripts/xnmt/config.las-pyramidal-preproc.yaml

Development data:

IWST.OfflineTask

Training Data and Data Conditions

A “constrained” setup is proposed as the official training data condition, in which the allowed training data is limited to a medium-sized framework in order to keep the training time and resource requirements manageable. In order to allow participants to leverage large language models and medium-sized resources, we propose a “constrained with large language models” condition, where a specific set of language models is allowed. In order to allow the participation of teams equipped with high computational power and effective in-house solutions built on additional resources, an “unconstrained” setup without data restrictions is also proposed.

Constrained training: Under this condition, the allowed training resources are the following ones (note that the list does not include any pre-trained language model):

Data type	src lang	tgt lang	Training corpus (URL)	Version	Comment
speech	en	–	LibriSpeech ASR corpus	v12	includes translations into pt, not to be used
speech	en	–	How2	na
speech	en	–	Mozilla Common Voice	v11.0
speech	en	–	TED LIUM	v2/v3
speech	en	–	Vox Populi	na
speech-to-text-parallel	en	de	MUST-C	v1.2/v2.0/v3.0
speech-to-text-parallel	en	ja, zh	MUST-C	v2.0
speech-to-text-parallel	en	de	MUST-Cinema	v1.0	with subtitle and line breaks
speech-to-text-parallel	en	de	Speech Translation TED corpus	na
speech-to-text-parallel	en	de, ja, zh	CoVoST	v2
speech-to-text-parallel	en	de	Europarl-ST	v1.1
text-parallel	en	de	Europarl	v10
text-parallel	en	es	Europarl	v8
text-parallel	en	es, zh, de, ja	NewsCommentary	v16
text-parallel	en	es, zh, de, ja	OpenSubtitles	v2018
text-parallel	en	de	OpenSubtitles	v2018 apptek	partially re-aligned, filtered, with document meta-information on genre
text-parallel	en	es	OpenSubtitles	v2018 apptek	partially re-aligned, filtered, with document meta-information on genre
text-parallel	en	ja	JParaCrawl
text-parallel	en	de	TED2020	v1
text-parallel	en	es	TED2020	v1
text-parallel	en	es, zh, de, ja	Tatoeba	v2022-03-03
text-parallel	en	es	ELRC-CORDIS_News	v1
text-parallel	en	de	ELRC-CORDIS_News	v1
text-monolingual	–	de	OpenSubtitles with subtitle breaks	v2018-apptek	superset of parallel data, with subtitle breaks and document meta-info on genre, automatically predicted line breaks
text-monolingual	–	es	OpenSubtitles with subtitle breaks	v2018-apptek	superset of parallel data, with subtitle breaks and document meta-info on genre, automatically predicted line breaks

Note: this list is identical to the one available in the subtitle task. Some training data are specific for the subtitling task including subtitle boundaries (<eob> and <eol>).

Constrained with Large Language Models training: Under this condition, all the constrained resources plus a restricted selection of large language models are allowed. The following pre-trained language models are considered parts of the training data and freely usable to build the SLT systems:
Unconstrained training: any resource, pre-trained language models included, can be used with the exception of evaluation sets

Submission Guidelines

Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run for each track. All other run submissions are treated as CONTRASTIVE runs. In the case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.
Submissions must be packaged as a gzipped TAR archive (see format below) and sent as an email attachment to iwslt_offline_task_submission@fbk.eu.
The TAR archive should include in the file name the type of system (cascade/end-to-end) used to generate the submission
Each run has to be stored as a plain text file with one sentence per line
Scoring will be case-sensitive and will include punctuation. Submissions have to be in UTF-8. Tags such as applause, laughing, etc are not considered during the evaluation.

TAR archive file structure:

< UserID >/< Set >.< LangDir >.< Task >.< UserID >.primary.txt  
  /< Set >.< LangDir >.< Task >.< UserID >.contrastive1.txt  
  /< Set >.< LangDir >.< Task >.< UserID >.contrastive2.txt  
  /...  

where:
< UserID > = user ID of the participant used the short name chosen in the registration form (e.g. the name of your institution)
< Set > = IWSLT21.SLT.tst2021
< LangDir > = en-de/zh/ja, using language identifiers (LIDs) as given by ISO 639-1 codes
< Task > = OfflineTask.
For example, FBK/IWSLT21.SLT.tst2021.en-de.OfflineTask.FBK.primary.txt

All the submissions should be sent to this address: iwslt_offline_task_submission@fbk.eu

The email should include the following information:

Institute:
Contact Person:
Email:
Data condition: Constrained/Unconstrained
Segmentation: Own/Given
Brief abstract about the system:
Multilingual: Yes/No
Do you want to make your submissions freely available for research purposes? (yes/no)

Contacts

Chairs: Matteo Negri (FBK, Italy), Marco Turchi (Zoom, Germany)

Discussion: iwslt-evaluation-campaign@googlegroups.com

Organizers

Sebastian Stüker (Zoom, Germany)
Jan Niehues (KIT, Germany)
Roldano Cattoni (FBK, Italy)
Tsz Kin Lam (The University of Edinburgh, the United Kingdom)
Barry Haddow (The University of Edinburgh, the United Kingdom)
Marcely Zanon Boito (NAVER LABS Europe, France)

Contact

Chair:
Discussion: iwslt-evaluation-campaign@googlegroups.com