📢 Announcements 📢
Deadline extension: We extend the deadline for the simultaneous translation track to ‼️Friday, April 17, 23:59 (AoE)‼️
Blind Test Set Published: We release the official dev sets and test sets, see below.
OmniSTEval version: use version >=0.1.7 to avoid issues with recording name matching in some datasets.
Asharq Business with Bloomberg: new test set version for SimulST with different segmentation.

Description

Simultaneous translation (also known as real-time or streaming translation) is the task of generating translations incrementally given partial input only. Simultaneous systems are typically evaluated with respect to quality and latency.

This year, there is one main track and one sub-track:

  • Speech-to-Text: simultaneously translating speech in source language into text in target language.
  • Speech-to-Text with Extra Context: same as above, but the systems can also leverage extra context (e.g., content of the presented ACL paper).

in the following language directions:

  • English -> German
  • English -> Chinese
  • English -> Italian
  • Czech -> English

We have three focuses this year:

  • long-form speech: our evaluation will be conducted on unsegmented speech
  • large language models: participants are allowed to use LLMs (details will be announced later)
  • extra context: a sub-track that allows participants to use additional context. This year, we provide the ACL paper PDFs associated with the ACL talks being translated as extra context.

The test set domains are the subsets of the ones of the offline track:

  • English -> German: ACL talks and accent challenge data
  • English -> Chinese: ACL talks
  • English -> Italian: ACL talks
  • Czech -> English: political conference talks

Training Data and Data Conditions

We follow the same data conditions as in the offline track (see here). Additionally, for the Docker submission, we require the system to be runnable on a single H100 with 80GB of memory.

The data condition for this task is “constrained with large language models (LLMs)”. Any open-weight model with a permissive license is acceptable for use. In addition, pretrained speech encoders and ASR models may be employed. We also encourage participants to submit systems leveraging closed-source models/LLMs for evaluation, but such systems will be evaluated separately and will not be eligible for the main ranking.

English-to-X

Our English-to-X training data condition follows that of the offline, the full list of datasets is presented below. All listed datasets can be automatically transated with the models allowed in the Constrained with Large Language Models settings. MCIF is the official development data. A derived version including audio, references, YAML files with the audio information (useful for metric computation), and PDFs useful for the speech-to-text with extra context track) can be found here.

Data type src lang tgt lang Training corpus (URL) Version Comment
speech en en LibriSpeech ASR corpus v12 includes translations into pt, not to be used
speech en en How2 na  
speech en en Mozilla Common Voice v24  
speech en en Vox Populi na  
speech-to-text-parallel en de, zh CoVoST v2  
speech-to-text-parallel en de, it Europarl-ST v1.1  
speech-to-text-parallel en en MOSEL v1, v2  
text-parallel en de, it Europarl v10  
text-parallel en de, it, zh NewsCommentary v18  
text-parallel en de, it, zh OpenSubtitles v2024  
text-parallel en de OpenSubtitles v2018 apptek partially re-aligned, filtered, with document meta-information on genre
text-parallel en de, it, zh Tatoeba v2023-04-12  
text-parallel en de ELRC-CORDIS_News v1  

Czech-to-English

  • ParCzech 3.0 (ASR):
    • Allowed data: parczech-3.0-asr-train-20*.tar.gz
    • https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3631?show=full
  • VoxPopuli (ST)
    • Unlabeled data: cs_v2
    • Translated data (cs → en)
    • Speech-to-speech data (cs → en)
    • https://github.com/facebookresearch/voxpopuli
  • Common Voice Corpus 20.0 (ASR)
    • Czech ASR data
    • CV version: 20.0
    • https://commonvoice.mozilla.org/en/datasets
  • Czeng 2.0 (MT)
    • https://ufal.mff.cuni.cz/czeng
  • OpenSubtitles v2018 (MT)
    • https://opus.nlpl.eu/OpenSubtitles/cs&en/v2018/OpenSubtitles
  • Europarl (MT)
    • https://www.statmt.org/europarl/
  • MOSEL (transcripts only)
    • automatic transcripts for unlabeled VoxPopuli audio
    • https://huggingface.co/datasets/FBK-MT/mosel
  • 2025 Dev Set (ST)
  • 2026 Dev Set (ST)

Baselines

Last year baselines for each language pair can be found here (GitHub).

Baseline implementations of Speech-to-Text with Extra Context can be found here. We will provide other baselines for this year soon.

Submission

The evaluation implementation will use the latest SimulStream toolkit (see paper here).

For the Speech-to-Text with Extra Context track, participants will also be given a file containing the paths to the PDF files of the ACL papers like this:

/path/to/paper1.pdf
/path/to/paper2.pdf
/path/to/paper3.pdf

Participants are allowed to preprocess the PDF files before running the simultaneous translation system.

Participants have two options for the submission:

Systems submitted via Docker image are expected to run on a single NVIDIA H100 GPU with 80 GB of HBM. Additionally, participants must include a README with instructions on how to run the system for each track and language direction. To enable communication between evaluators and participants, a point of contact and email address should be provided in the README in case of issues during evaluation.

Regardless of the submission type (Docker or log), participants must also submit results on the development set (i.e., MCIF or the dedicated Czech-to-English dev set) to determine the latency regime of their submission.

Submission link: Dropbox Folder

Participants will be allowed to update their submissions during the evaluation period. If you have specific questions regarding your submission to the simultaneous shared task, please reach out via e-mail at agostinv@oregonstate.edu.

Evaluation

Metrics

The system’s performance will be evaluated in two ways:

  • Quality:
    • COMET-XL (Unbabel/XCOMET-XL)
    • Additional results using other metrics (chrF, BLEURT, …)
  • Latency:
    • For the main ranking, we will use LongYAAL, implemented within OmniSTEval.
    • For consistency with the previous year, we will also include StreamLAAL.

For latency measurement, we will contrast computation aware and non computation aware latency metrics.

Ranking

The systems will be ranked by the translation quality within the latency constraints. System latency regime (low/high) is based on logs with development set results.

This year, we have two latency regimes, low and high. The latency constraing are shared across all language pairs, measured by non-computation-aware LongYAAL:

  • Low: 0-2 seconds,
  • High: 2-4 seconds.

Human Evaluation

Human evaluation will be conducted for primary submissions.

Dev and Test Sets

This section describes the dev and test sets for the simultaneous track.

The dev sets will be used to determine the latency regime of the submissions and are a mandatory part of ALL submissions in the form of logs. The test sets are the same for all submissions but the output logs should be generated only for the log-based submissions. For participants submitting Docker images, the evaluation will be conducted on the same test sets, but the organizers will run the submitted Docker images to allow for the comparison of computation-aware latency.

Participants are asked to provide SimulStream log files with the translation outputs and the timestamps of the generated translations for the test sets described below. The test sets consist of long-form audio recordings of talks, which are unsegmented (up to 2.5 hours in duration):

  • ACL Talks presented at the ACL conferences, accompanied by the corresponding ACL paper PDFs, which can be used as extra context for the *Speech-to-Text with Extra Context sub-track. The talks are in English and the translations are into German, Chinese, and Italian.
  • Political conference talks for Czech to English.
  • Optional Evaluation Domains:
    • Asharq-Bloomberg news, for English to: Chinese and German.
    • YODAS YouTube dataset, for English to: Chinese and German.

Audio-visual documents of development and evaluation sets are provided in MP4 format (ACL Talks and Asharq-Bloomberg) and WAV format (YODAS and Political conference talks). The translation log files should contain the translations of the audio recordings, along with the timestamps of the generated translations. The log format should follow one of the following:

  • SimulStream format (preferred) - mandatory with the Docker submission,
  • Log-based submission are allowed to use the legacy SimulEval JSONL format.

See the OmniSTEval and SimulStream for more details on the expected log format.

  • Main Evaluation Domain En-to-{German, Chinese, and Italian}: ACL Talks are a collection of talks presented at the ACL conferences. The talks cover a wide range of topics in natural language processing and computational linguistics, and they are accompanied by the corresponding ACL paper PDFs, which can be used as extra context for the Speech-to-Text with Extra Context sub-track. The talks are in English and the translations are into German, Chinese, and Italian:
    • Dev Set: MCIF is the official development set for this track. You can download a derived version including audio, references, YAML files with the audio information (for the quality and latency evaluation), and PDFs for the speech-to-text with extra context track here.
    • Test Set: the audio and optionally the PDFs (for the Speech-to-Text with Extra Context sub-track) can be downloaded from here.
  • Optional Evaluation Domain En-to-{Chinese, German}: Asharq Business with Bloomberg is part of SRMG, the largest integrated media group in the MENA (Middle East and North Africa) region. An exclusive content agreement with ‘Bloomberg Media’ powers this distinguished business news multi-platform, drawing on Bloomberg’s comprehensive coverage from more than 2,700 journalists and analysts globally. Asharq Business with Bloomberg is a leading source for Arabic economic news rich in context and content and unparalleled market data, delivered through a TV channel and across digital and social media platforms. Professional human reference translations into Chinese, and German have been created by AppTek.

    • The test2026 set can be downloaded from here; it consists of one single recording lasting approximately two hours. The archive contains a README file with important information, audio files, and YAML files which provide the audio segments for which translations must be created. Note: this is an updated version for SimuST participants.
  • Optional Evaluation Domain En-to-{Chinese, German}: YODAS (YouTube-Oriented Dataset for Audio and Speech) is “a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets.” Refer to this paper for more details.
    IMPORTANT NOTE: the “en003” partition of the YODAS dataset is used for selecting dev/test data and is therefore not permitted for training (e.g. for an auxiliary ASR task). This partition had also been used to select a speech recognition benchmarking test set by the creators of the Loquacious dataset and thus is a natural held-out choice. Professional human reference translations into Chinese, Japanese, and German have been created by AppTek.
    • The test2026 set can be downloaded from here; it consists of five audio recordings, each lasting approximately 10 to 30 minutes.
  • Main Evaluation Domain Czech-to-English: The development and test sets for Czech-to-English consist of long-form audio recordings of talks from political conferences.
  • Main Evaluation Domain Czech-to-English with Extra Context: the Czech-to-English test set for the Speech-to-Text with Extra Context sub-track consists of selected recordings of Linguistic Mondays Seminars at Charles University in Prague. The recordings can be downloaded from here. The archive contains the audio recordings and the corresponding PDFs of the presentations, which can be used as extra context for the Speech-to-Text with Extra Context sub-track. There is no dev set for this sub-track. To determine the latency regime of the submissions for this sub-track, we will use the same dev set as for the main track (i.e., the IWSLT26 Czech-to-English Dev Set, see above).

Organizers

  • Peter Polák (chair, Charles University)
  • Siqi Ouyang (co-chair for the Context Subtrack, Carnegie Mellon University)
  • Victor Agostinelli (Oregon State University)
  • Ondřej Bojar (Charles University)
  • Lizhong Chen (Oregon State University)
  • David Javorský (Charles University)
  • Nam Hoang Luu (Charles University)
  • Sara Papi (FBK)
  • Katsuhito Sudoh (Nara Women’s University)

Contact

Discussion: iwslt-evaluation-campaign@googlegroups.com

  • Peter Polák: [surname]@ufal.mff.cuni.cz
  • Siqi Ouyang: siqiouya@andrew.cmu.edu
  • Victor Agostinelli: agostinv@oregonstate.edu
  • Lizhong Chen: chenliz@oregonstate.edu