This is an old revision of the document!

Video Speech Translation

Task Description

We are living the multiple modalities world in which we see objects, hear sounds, feel texture, smell odors, and so on. The purpose of this shared task is to ignite possibilities of multimodal machine translation. This shared task examines methods for combining video and audio sources as input of translation models. In addition to generally advancing the state of the art, our specific goals are:

  • to thoroughly investigate and understand challenges in translating videos and identify promising applications.
  • to create a public benchmarks for the video translation task
  • to study the translation errors and new approaches for evaluating video translation outputs
  • to study the performance of video translation on realistic scenarios

Similar to WMT evaluations, there are 2 evaluation tracks and focus on Chinese-English and English-Russian directions.

  • Constrained submission: You are required to only use the datasets we provided in the Data section.
  • Unconstrained submission: We also welcome unconstrained submissions i.e. you are also welcome to use additional datasets. If you do so, please flag all the unconstrained data sources used in your system.



  1. English: all data provided by the offline and simultaneous speech translation challenges
  2. Chinese:

All data sets from the OPUS project and the WMT evaluations are eligible. Additionally, participants can use the following data.

  1. Chinese-English:
  1. English-Russian:

Currently, we do not have publicly available video corpora on the focus language directions Chinese-English and English-Russian. However, we think that the video information from the following corpora might be helpful to multimodal MT.

  1. How2: educational-domain from Youtube, English-(Portuguese, German) translations.
  2. VATEX: human action, very short videos (about 10 seconds), English-Chinese captions
Dev & Test

We will provide the dev and test sets of e-commerce live shows. In particular, we will provide Chinese video clips which will be translated into English, and English video clips which will be translated into Russian. These dev and test sets contain video, manual transcriptions, and human translations. The unseen test will be released when the evaluation is due.

Chinese-English dev set:


- Constrained track

  • ASR: IWSLT organizers provides English engine, and we can provide a Kaldi-based Chinese system
  • MT: We provide transformer-based systems for Chinese-(English, Russian, Japanese) and English-(Russian, Vietnamese)

- Unconstrained track: Participants are encouraged to use whatever resources to build video translation systems. we will provide ASR and MT outputs from Online systems as baseline.


Evaluation will be carried out both automatically and manually. Automatic evaluation will make use of standard machine translation metrics, such as BLEU. Native speakers of each of the languages will manually check the quality of the translation for a small sample of the submissions. We also expect participants to support us in the manual evaluation (accordingly to the number of submissions)


Nguyen Bach (Alibaba)
Wei Luo (Alibaba)
Boxing Chen (Alibaba)
Fei Huang (Alibaba)