Description

Note: this is a draft and feedback on the task setup is welcome, either via iwslt-evaluation-campaign@googlegroups.com or Twitter (@iwslt)

Simultaneous translation (also known as real-time or streaming translation) is the task of generating translations incrementally given partial input only. Simultaneous translation enables interesting applications such as automatic simultaneous interpretation or international conference translations. Simultaneous systems are typically evaluated with respect to quality and latency. This year, we will have 2 tracks and 3 language pairs:

  • Text-to-Text: translating the output of a streaming ASR system in real-time from English to German, English to Japanese, and English to Mandarin Chinese.
  • Speech-to-Text: translating speech into text in real-time from English to German, English to Japanese, and English to Mandarin Chinese.

We want to highlight the differences with respect to last edition:

  • for the text-to-text track, we will use the output of a streaming ASR system as input instead of the gold transcript. As a result, both text-to-text and speech-to-text systems will be ranked together for a given language pair.
  • we are adding Mandarin Chinese as a target language.
  • we are adding an experimental manual evaluation for English-to-German real-time translation
  • we are adding human interpretation benchmark for English-to-German speech translation
  • in order to reduce the number of conditions, we use only segmented input; the manual evaluation will run on reconstructed full documents

We encourage participants to enter all tracks when possible. We also encourage participants to contrast cascaded and end-to-end solutions for the Speech-to-Text track.

Evaluation

This year, we will use automatic evaluation very similar to the last year and we will trial manual evaluation for English-to-German track.

Automatic Evalution

We will use a very similar system as last year for evaluation. The system’s performance will be evaluated in two ways:

  • Translation quality: we will use multiple standard metrics: BLEU, TER, and METEOR.
  • Translation latency: we will use standard metrics for simultaneous machine translation including average proportion (AP), average lagging (AL) and differentiable average lagging (DAL).

Like last year, the evaluation implementation will use the SimulEval toolkit. For latency measurement, we will contrast computation aware and non computation aware latency metrics. See the SimulEval description for how those metrics are defined. Note that the definition of average lagging has been modified from the original definition (see section 3.2 in the SimulEval description). The latency is calculated on word level for En-De systems and character level for En-Ja systems and En-Zh systems.

The participants will submit a Docker image (see below for an example) and the organizers will run the image in a controlled environment, specifically an ap3.2xlarge AWS instance (see details in https://aws.amazon.com/ec2/instance-types/p3/).

Ranking for Automatic Evaluation

We will evaluate translation quality with detokenized BLEU and latency with AP, AL and DAL. The systems will be ranked by the translation quality with different latency regimes. Three regimes, low, medium and high, will be evaluated. Each regime is determined by a maximum latency threshold. The thresholds are determined by AL, which represents the delay to the perfect real time system (milliseconds for speech and number of words for text), but all three latency metrics, AL, DAL and AP will be reported. Based on analysis on the quality-latency tradeoffs for the baseline systems, the thresholds are set as follows:

Speech Translation (English-German):

  • Low Latency: AL < = 1000
  • Medium Latency: AL < = 2000
  • High Latency: AL < = 4000

Speech Translation (English-Mandarin):

  • Low Latency: AL < = TBD
  • Medium Latency: AL < = TBD
  • High Latency: AL < = TBD

Text Translation (English-German):

  • Low Latency: AL < = 3
  • Medium Latency: AL < = 6
  • High Latency: AL < = 15

Text Translation (English-Japanese):

  • Low Latency: AL < = 8
  • Medium Latency: AL < = 12
  • High Latency: AL < = 16

Text Translation (English-Mandarin):

  • Low Latency: AL < = TBD
  • Medium Latency: AL < = TBD
  • High Latency: AL < = TBD

The submitted systems will be categorized into different regimes based on the AL calculated on the Must-C English-German and English-Mandarin test sets (tst-COMMON) for English-German and English-Mandarin or on the IWSLT21 dev set for English-Japanese, while the translation quality will be calculated on the blind test set. We require participants to submit at least one system for each latency regime. Participants are encouraged to submit multiple systems for each regime in order to provide more data points for latency-quality tradeoff analyses. If multiple systems are submitted, we will keep the one with the best translation quality for ranking. In addition, within each latency regime, we will also measure computation aware AL and rank systems accordingly. Finally, we will report latency-quality trade-off curves for non computation aware AL and for computation aware AL in the findings paper.

Note that for English-German, we will use the release v2.0 of MuST-C and for English-Mandarin, we will use the release v1.2 of MuST-C

Manual Evaluation

English-to-German track will include manual evaluation of simultaneous speech translation for at least one variant of submitted system for each participating team (based on the selection by the team).

The evaluation will consist in playing the source sound/video with live text captions to speakers fluent in the source English and native in the target German, and collecting “continuous ranking”. This method is described in Section 3.1.1 (page 22) in the master thesis by Dávid Javorský.

As a benchmark, human interpretations presented in the exact same form of live text captions, will be scored in the same setting.

Training and Development Data

English-German, English-Mandarin

You may use the same training and development data available for the Offline Speech Translation task. Specifically, please refer to the Allowed Training Data and the Past Editions Development Data sections.

English-Japanese

We provide a version of MuST-C prepared for this shared task MuST-C v2.0 for training and development. For training, you may also use the parallel data and monolingual data available for the English-Japanese WMT20 news task.

Baseline Implementation and Example

English-to-German Speech-to-Text Translation

You can find a baseline and instructions on how to reproduce it here. Our final evaluation will be run inside Docker. To run an evaluation with Docker, first build a Docker image from the Dockerfile. Here is an example Dockerfile for the baseline:

FROM ubuntu:20.04

MAINTAINER Juan Pino (juancarabina@fb.com)

RUN apt-get update && apt-get install -y build-essential git python3 python3-pip libsndfile1
RUN git clone https://github.com/pytorch/fairseq.git /fairseq
RUN pip3 install torch torchaudio vizseq soundfile sentencepiece sacrebleu=="1.5.1"
WORKDIR /fairseq
RUN pip3 install -e .
RUN git clone https://github.com/facebookresearch/SimulEval.git /SimulEval
WORKDIR /SimulEval
RUN pip3 install -e .
RUN ln -s /usr/bin/python3 /usr/bin/python

ENTRYPOINT simuleval \
    --agent $AGENT \
    --source $SRC_FILE \
    --target $TGT_FILE \
    --output $OUTPUT \
    --scores $EXTRA_AGENT_ARGS

Assuming your current directory contains the Dockerfile above, you can run the following commands to run the baseline evaluation inside Docker:

FAIRSEQ=<PATH TO fairseq>
AGENT=fairseq/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py
WORKDIR=<WORKING DIRECTORY> # `input` contains source/target files and wav files, `models` contains databin and the checkpoint
SRC_FILE=input/dev.wav_list
TGT_FILE=input/dev.de
DATA=models/databin
CONFIG_YAML=config_st.yaml
MODEL=models/convtransformer_wait5_pre7
EXTRA_AGENT_ARGS="--data-bin $DATA --config $CONFIG_YAML --model-path $MODEL"
OUTPUT=output

docker build -t iwslt2021_simulst_baseline:latest .
docker run -e AGENT="$AGENT" -e SRC_FILE="$SRC_FILE" -e TGT_FILE="$TGT_FILE" -e EXTRA_AGENT_ARGS="$EXTRA_AGENT_ARGS" -e OUTPUT="$OUTPUT" -v $FAIRSEQ:/SimulEval/fairseq -v $WORKDIR/input:/SimulEval/input -v $WORKDIR/models:/SimulEval/models -it iwslt2021_simulst_baseline

If you encounter a bus error similar to this issue, you can try adding --shm-size 8G to the docker run command.

When submitting your system, please make sure it works for the MuST-C dev and test sets. During the official evaluation, we will run the submitted system with the blind set.

English-to-Japanese Text-to-Text Translation

Baseline will be provided later in January 2022.

System Submission

Participants are required to run the evaluation on the English-German dev and tst-COMMON MuST-C sets for the English-German and English-Chinese tracks and on the IWSLT21 dev set (TODO: missing speech) for the English-Japanese track and report the results as part of the submission. This is to make sure that the submitted systems work so that organizers can run them as well. The submission files should be packed into a zip or tar.gz file and uploaded to Dropbox[dropbox] prior to the deadline (TBD anywhere on earth). The submission files should include instructions on how to run the system in a README or README.md file as well as all the necessary files (Docker image, checkpoints, vocabulary, etc.) for the organizers to be able to run the system.

Contacts

Discussion: iwslt-evaluation-campaign@googlegroups.com

Organizers

  • Katsuhito Sudoh (NAIST)
  • Satoshi Nakamura (NAIST)
  • Ondřej Bojar, Věra Kloudová, Dávid Javorský (Charles University)
  • Barry Haddow (University of Edinburgh)
  • Jiatong Shi (CMU)
  • Shinji Watanabe (CMU)
  • Xutai Ma (Johns Hopkins University, Meta)
  • Maha Elbayad (Meta)
  • Changhan Wang (Meta)
  • Hongyu Gong (Meta)
  • Juan Pino (Meta)