Simultaneous Speech Translation

Task Description

Simultaneous machine translation has become an increasingly popular topic in recent years. In particular, simultaneous speech translation (SST) enables interesting applications such as subtitle translation for a live event or real-time video-call translation. The goal of this challenge is to examine systems for translating audio speech in the source language into text in the target language with consideration of both translation quality and latency.

We encourage participants to submit systems either based on cascaded (ASR + MT) or end-to-end approaches. This year, participants will be evaluated on translating TED talks from English into German. They will be given two parallel tracks to enter:

  • Text-to-Text: translating ground-truth transcripts in real-time.
  • Speech-to-Text: directly translating speech into text in real-time.

We encourage participants to enter both tracks when possible.

Evaluating a simultaneous system is not trivial as we cannot release the test data as offline translation tasks do. Instead, participants will be required to implement a provided API to read the input and write the translation, and upload their system as a Docker image so that it can be evaluated under controlled conditions. We will provide an example implementation which will also serve as baseline system.

The system's performance will be evaluated in two ways:

  • Translation quality: we will use multiple standard metrics: BLEU, TER, and ChrF.
  • Translation latency: we will make use of the recently developed metrics for simultaneous machine translation including average proportion (AP), average lagging (AL) and differentiable average lagging (DAL).

In addition, we will report timestamps for informational purposes. We will provide an example for computing these metrics together with an example of Docker image.

Training and Development Data

You may use the same training and development data available for the Offline Speech Translation task. Specifically, please refer to the “Allowed Training Data” and the “Past Editions Development Data” sections.

Evaluation Server

In this shared task, we provide an evaluation server that reads the raw data and sends out the source sentence step by step. Participants are required to adapt their model to work as a client, using our provided API, in order to receive inputs from the server and return simultaneous translations.
An evaluation script running on the server will automatically calculate the participant's system performance for both quality and latency. You can refer to this directory for the client/server implementation.

Here, we provide an example of using the evaluation script, as well as a simple baseline implemented in Fairseq.

Evaluation

We will report BLEU, AP, AL and DAL metrics. Systems will be ranked by BLEU score for various maximum latency thresholds (for example, all systems with latency less than 1 DAL will be ranked together, etc.) and by latency score for various BLEU thresholds. We encourage participants to submit different systems with varying latency and quality regimes.

Submission Guidelines

This year, participants need to submit their systems through Dockers which contains both your model and the evaluation script mentioned above. We will provide a simple tutorial of packing everything into a docker image very soon.

Please pack and upload your docker file through this link

Cloud Credits Application

Update: Applications are closed as of January 31 2020. Participants in this task may have access to a limited amount of cloud credits in order to train their systems. Please apply by filling out this very short form.

Contacts

Chair: Jiatao Gu (Facebook, USA)
Discussion: iwslt-evaluation-campaign@googlegroups.com

Organizers

  • Jiatao Gu (Facebook)
  • Juan Pino (Facebook)
  • Changhan Wang (Facebook)
  • Xutai Ma (JHU)
  • Fahim Dalvi (QCRI)
  • Nadir Durrani (QCRI)