Description

Simultaneous translation (also known as real-time or streaming translation) is the task of generating translations incrementally given partial input only. Simultaneous translation enables interesting applications such as automatic simultaneous interpretation or international conference translations. Simultaneous systems are typically evaluated with respect to quality and latency.

There will be two tracks:

  • Text-to-Text: simultaneously translating text in source language into text in target language
  • Speech-to-Text: simultaneously translating speech in source language into text in target language.

in the following language directions (more details will be made available soon):

  • English -> German
  • English -> Arabic (new)
  • English -> Chinese
  • English -> Japanese
  • Czech -> English

We have two focuses this year:

  • long-form speech: our evaluation will be conducted on unsegmented speech
  • large language models: participants are allowed to use LLMs (details will be announced later)

The test set domains are the subsets of the ones of the offline track:

  • English -> German: ACL 60/60 and accent challenge data
  • English -> Arabic: business news
  • English -> Chinese: ACL 60/60
  • English -> Japanese: ACL 60/60
  • Czech -> English:

Data

The data condition for this task is “constrained with large language models (LLMs)”.

English-to-X

Our English-to-X data condition follows the one in the offline task. The list is available here.

Czech-to-English

Details will be available later.

Baselines

Baselines will be provided later, including automatic speech segmentation for long-form speech.

Submission

The evaluation implementation will use the latest SimulEval toolkit. Participants have two options for the submission:

  • Docker image submission; the organizers run the system to compare the computation-aware latency
  • System log submission; the computation-aware latency cannot be compared directly but will be reported with its hardware difference

Details will be provided later.

Evaluation

Metrics

The system’s performance will be evaluated in two ways:

  • Translation quality metrics:
    • BLEU
    • Additional results using neural metrics (COMET, BLEURT, …)
  • Translation latency:
    • Average Lagging
    • Length Adaptive Average Lagging
    • Average Token Delay

For latency measurement, we will contrast computation aware and non computation aware latency metrics. See the SimulEval description for how those metrics are defined. Note that the definition of average lagging has been modified from the original definition (see section 3.2 in the SimulEval description).

Ranking

The systems will be ranked by the translation quality within the latency constraints. The detailed constraint for each track will be announced later.

Organizers

  • Victor Agostinelli (Oregon State University)
  • Lizhong Chen (Oregon State University)
  • Sara Papi (FBK)
  • Peter Polák (Charles University)
  • Katsuhito Sudoh (Nara Women’s University)

Contact

Chair(s): Katsuhito Sudoh (Nara Women’s University)

Discussion: iwslt-evaluation-campaign@googlegroups.com