Recent advances in deep learning are giving the possibility to address traditional NLP tasks in a new and completely different manner. One of these tasks is spoken language translation (SLT). For years, SLT has been addressed by cascading an automatic speech recognition (ASR) and a machine translation (MT) system. Recent trends rely on using a single neural network to directly translate the input audio signal in one language into a text in a different language without intermediate symbolic representations, e.g., transcriptions.

The goal of the Offline Speech Translation Task is to examine automatic methods for translating audio speech in one language into text in the target language. This has to be done either by exploiting cascaded solutions or end-to-end approaches. The last editions’ results have confirmed that the performance of end-to-end models is approaching the results of cascade solutions, however, without identifying the best-performing technology. Moreover, all the recent evaluations have been based on test sets extracted from TED talks. In this controlled scenario, a single speaker acts out a prepared speech without background noise and interaction with other speakers.

In addition to answering the question if the cascade solution is still the dominant technology, this year we will address an additional research question in the evaluation:

  • Is the current spoken language translation technology able to deal with more complex scenarios (e.g. spontaneous speech, terminology, and dialogues)? In addition to the classic TED talk test set from English into German, the task introduces two more test sets that face more challenging scenarios:
    • ACL presentations: a single speaker is presenting on a stage. Although this is similar to the TED talk scenario, the speech translation system needs to deal with non-native speakers, different accents, various recording quality, terminology, and controlled interaction with a second speaker.
    • Press conferences and interviews: in this scenario, two persons interact on different topics. The speech translation system needs to deal with non-native speakers, different accents, controlled interaction with a second speaker, and spontaneous speeche.

Similarly to last year, three language directions are proposed in the offline task. Each language direction will be tested in different evaluation scenarios:

  • English -> German: TED talks, ACL presentations and press conference and interviews.
  • English -> Japanese: TED talks and ACL presentations.
  • English -> Chinese: TED talks and ACL presentations.

The system’s performance will be evaluated with respect to their capability to produce translations similar to the target-language references. Such similarity will be measured in terms of multiple automatic metrics: BLEU, TER, BEER and characTER. The submitted runs will be ranked based on the BLEU calculated on the test set by using automatic resegmentation of the hypothesis based on the reference translation by mwerSegmenter. The detailed evaluation script can be found in the SLT.KIT. Moreover, to meet the requests of last year’s participants, a human evaluation will be performed on the best performing submission of each participant.

Evaluation Conditions

Both cascade and end-to-end models will be evaluated. We kindly ask each participant to specify at submission time if a cascade or an end-to-end model has been used.

In this task, we use the following definition of end-to-end model:

  • No intermediated discrete representations (source language like in cascade or target languages like in rover)
  • All parameters/parts that are used during decoding need to be trained on the end2end task (may also be trained on other tasks -> multitasking ok, LM rescoring is not ok)

All the systems will be evaluated on the combination of the different test tests (depending on the language directions) and on each specific test set. It is important to note that all the test sets will be released together, but specific information to identify the different test sets will be associated with the data. Each audio file will have a clear identifier of the type of data: e.g. TEDtalk_1.wav, ACL_1.wav, Press_1.wav. More detailed information will be released with the test sets.

Test Data

Past Editions Development Data

The development data is not segmented using the reference transcript. The archives contain segmentation into sentence-like segmentation using automatic tools. But the participants might also use a different segmentation. The data provided as an archive with the following files ($set e.g. IWSLT.TED.dev2010):

  • $set.en-de.en.xml: Reference transcript (will not be provided for evaluation data)
  • $set.en-de.en.xml: Reference translation (will not be provided for evaluation data)
  • CTM_LIST: Ordered file list containing the ASR Output CTM Files (will not be provided for evaluation data) (Generated by ASR systems that use more data)
  • FILE_ORDER: Ordered file list containing the wav files
  • $set.yaml: This file contains the time steps for sentence-like segments. It is generated by the LIUM Speaker Diarization tool.
  • $set.h5: This file contains the 40-dimensional Filterbank features for each sentence-like segment of the test data created by XNMT.
  • The last two files are created by the following command: python -m xnmt.xnmt_run_experiments /opt/SLT.KIT/scripts/xnmt/config.las-pyramidal-preproc.yaml

Development data:

(Please note that system generated the provided ASR scripts use more training data than allowed for this year’s evaluations)

Training Data and Data Conditions

A “constrained” setup is proposed as the official training data condition, in which the allowed training data is limited to a medium-sized framework in order to keep the training time and resource requirements manageable. In order to allow participants to leverage large language models and medium-sized resources, we propose a “constrained with large language models” conditions, where a specific set of language models is allowed. In order to allow also the participation of teams equipped with high computational power and effective in-house solutions built on additional resources, an “unconstrained” setup without data restrictions is also proposed.

  • Constrained training: Under this condition, the allowed training resources are the following ones (note that the list does not include any pre-trained language model):
Data type src lang tgt lang Training corpus (URL) Version Comment
speech en LibriSpeech ASR corpus v12 includes translations into pt, not to be used
speech en How2 na  
speech en Mozilla Common Voice v11.0  
speech en TED LIUM v2/v3  
speech en Vox Populi na  
speech-to-text-parallel en de MUST-C v1.2/v2.0/v3.0 A new version of MuST-C en-de has been released!! please check it out!
speech-to-text-parallel en ja, zh MUST-C v2.0  
speech-to-text-parallel en de, es MUST-Cinema v1.0 with subtitle and line breaks
speech-to-text-parallel en es MUST-C v1.2 same as MUST-Cinema below but without subtitle breaks
speech-to-text-parallel en de Speech Translation TED corpus na  
speech-to-text-parallel en de, ja, zh CoVoST v2  
speech-to-text-parallel en de, es Europarl-ST v1.1  
text-parallel en de Europarl v10  
text-parallel en es Europarl v8  
text-parallel en es, zh, de, ja NewsCommentary v16  
text-parallel en es, zh, de, ja OpenSubtitles v2018  
text-parallel en de OpenSubtitles v2018 apptek partially re-aligned, filtered, with document meta-information on genre
text-parallel en es OpenSubtitles v2018 apptek partially re-aligned, filtered, with document meta-information on genre
text-parallel en ja JParaCrawl    
text-parallel en de TED2020 v1  
text-parallel en es TED2020 v1  
text-parallel en es, zh, de, ja Tatoeba v2022-03-03  
text-parallel en es ELRC-CORDIS_News v1  
text-parallel en de ELRC-CORDIS_News v1  
text-monolingual de OpenSubtitles with subtitle breaks v2018-apptek superset of parallel data, with subtitle breaks and document meta-info on genre, automatically predicted line breaks
text-monolingual es OpenSubtitles with subtitle breaks v2018-apptek superset of parallel data, with subtitle breaks and document meta-info on genre, automatically predicted line breaks

Note: this list is identical to the one available in the subtitle task. Some training data are specific for the subtitling task including subtitle boundaries (<eob> and <eol>).

Submission Guidelines

  • Multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run for each track. All other run submissions are treated as CONTRASTIVE runs. In the case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.
  • Submissions have to be submitted as a gzipped TAR archive (see format below) and sent as an email attachment to
  • The TAR archive should include in the file name the type of system (cascade/end-to-end) used to generate the submission
  • Each run has to be stored in a plain text file with one sentence per line
  • Scoring will be case-sensitive and including the punctuation. Submissions have to be in UTF-8. Tags such as applause, laughing etc are not considered during the evaluation.

TAR archive file structure:

< UserID >/< Set >.< LangDir >.< Task >.< UserID >.primary.txt  
  /< Set >.< LangDir >.< Task >.< UserID >.contrastive1.txt  
  /< Set >.< LangDir >.< Task >.< UserID >.contrastive2.txt  

< UserID > = user ID of participant used the short name chosen in the registration form (e.g. the name of your institution)
< Set > = IWSLT21.SLT.tst2021
< LangDir > = en-de/zh/ja, using language identifiers (LIDs) as given by ISO 639-1 codes
< Task > = OfflineTask.
For example, FBK/IWSLT21.SLT.tst2021.en-de.OfflineTask.FBK.primary.txt

All the submissions should be sent to this address:

The email should include the following information:

  • Institute:
  • Contact Person:
  • Email:
  • Data condition: Constrained/Unconstrained
  • Segmentation: Own/Given
  • Brief abstract about the system:
  • Multilingual: Yes/No
  • Do you want to make your submissions freely available for research purposes? (yes/no)


Chairs: Marco Turchi (Zoom, Germany) and Matteo Negri (FBK, Italy)



Sebastian Stüker (Zoom, Germany)
Jan Niehues (KIT, Germany)
Roldano Cattoni (FBK, Italy)