Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
simultaneous_translation [2020/02/07 05:04]
jpino
simultaneous_translation [2020/07/08 07:28] (current)
jpino
Line 3: Line 3:
 === Task Description === === Task Description ===
  
-Simultaneous machine translation has become an increasingly popular topic in recent years. In  particular, **simultaneous speech translation (SST)** enables interesting applications such as subtitle translation for a live event or real-time video-call translation. The goal of this challenge is to examine systems for translating audio speech ​in the source language into text in the target language with consideration of both translation quality and latency. ​+Simultaneous machine translation has become an increasingly popular topic in recent years. In  particular, **simultaneous speech translation (SST)** enables interesting applications such as subtitle translation for a live event or real-time video-call translation. The goal of this challenge is to examine systems for translating audio in source language into text in target language with consideration of both translation quality and latency. ​
  
 We encourage participants to submit systems either based on **cascaded (ASR + MT)** or **end-to-end** approaches. This year, participants will be evaluated on translating TED talks from **English into German**. They will be given two parallel tracks to enter: We encourage participants to submit systems either based on **cascaded (ASR + MT)** or **end-to-end** approaches. This year, participants will be evaluated on translating TED talks from **English into German**. They will be given two parallel tracks to enter:
Line 10: Line 10:
 We encourage participants to enter both tracks when possible. We encourage participants to enter both tracks when possible.
  
-Evaluating a simultaneous system is not trivial as we cannot release the test data as offline translation tasks do. Instead, participants will be required to implement a provided API to read the input and write the translation,​ and upload their system as a Docker ​image so that it can be evaluated under controlled conditions. We will provide an example implementation ​which will also serve as baseline system.+Evaluating a simultaneous system is not trivial as we cannot release the test data as offline translation tasks do. Instead, participants will be required to implement a provided API to read the input and write the translation,​ and upload their system as a Docker ​file so that it can be evaluated under controlled conditions. We provide an [[https://​github.com/​pytorch/​fairseq/​tree/​simulastsharedtask/​examples/​simultaneous_translation|example implementation ​and a baseline system]].
  
 The system'​s performance will be evaluated in two ways: The system'​s performance will be evaluated in two ways:
  
-  * **Translation quality**: we will use multiple standard metrics: BLEU, TER, and ChrF.+  * **Translation quality**: we will use multiple standard metrics: BLEU, TER, and METEOR.
   * **Translation latency**: we will make use of the recently developed metrics for simultaneous machine translation including average proportion (AP), average lagging (AL) and differentiable average lagging (DAL). ​   * **Translation latency**: we will make use of the recently developed metrics for simultaneous machine translation including average proportion (AP), average lagging (AL) and differentiable average lagging (DAL). ​
  
-In addition, we will report timestamps for informational purposes. We will provide an example for computing these metrics together with an example of Docker image.+In addition, we will report timestamps for informational purposes.
  
 ===Training and Development Data ===  ===Training and Development Data === 
Line 34: Line 34:
  
 === Evaluation === === Evaluation ===
 +We will evaluate translation quality with detokenized BLEU and latency with [[https://​arxiv.org/​abs/​1906.00048|AP,​ AL and DAL]]. The systems will be ranked by the translation quality with different latency regimes. Three regimes, low, medium and high, will be evaluated. Each regime is determined by a maximum latency threshold. The thresholds are determined by AL, which represents the delay to the perfect real time system (milliseconds for speech and number of words for text), but all three latency metrics, AL,  DAL and AP will be reported. Based on analysis on the quality-latency tradeoffs for the baseline systems, the thresholds are set as follows:
  
-We will report BLEU, [[https://​arxiv.org/​abs/​1906.00048|AP, ​AL and DAL]] metrics. Systems ​will be ranked by BLEU score for various maximum ​latency ​thresholds (for example, all systems with latency ​less than 1 DAL will be ranked togetheretc.).+Speech Translation:​ 
 +  * Low latency: AL < = 1000 
 +  * Medium latency: AL < = 2000 
 +  * High Latency: ​ AL < = 4000 
 +Text Translation 
 +  * Low latency: ​ AL < = 3 
 +  * Medium Latency: AL < = 6 
 +  * High Latency: AL < = 15 
 +The submitted systems ​will be categorized into different regimes based on the AL calculated on the Must-C English-German test set, while the translation quality will be calculated on the blind test set. We require participants to submit at least one system ​for each latency ​regime. Participants are encouraged to submit multiple systems ​for each regime in order to provide more data points for latency-quality tradeoff analyses. If multiple ​systems ​are submitted, we will keep the one with the best translation quality for ranking. Besides the three latency ​metricswe will also calculate the total decoding time under the server-client evaluation scheme for each system.
  
 === Submission Guidelines === === Submission Guidelines ===
  
-This year, participants ​need to submit their systems ​through Dockers which contains both your model and the evaluation script mentioned above. We will provide ​a simple tutorial ​of packing everything into a docker ​image very soon.+Participants ​need to submit their systems ​as a Docker file along with necessary ​model files. We provide ​[[https://​github.com/​pytorch/​fairseq/​blob/​simulastsharedtask/​examples/​simultaneous_translation/​docs/​baseline.md#​final-evaluation-with-docker|here]] an example ​of Dockerfile together with baseline 
 + 
 +**Please pack and upload your docker ​file and model files through this [[https://​www.dropbox.com/​request/​vIadDRsH1LJkBDWGgCbE|link]]**. Please prefix your files with a meaningful institution name. 
 + 
 +=== Results === 
 + 
 +The results of the shared task are now available! You can now [[https://​dl.fbaipublicfiles.com/​simultaneous_translation/​results.tsv|download a tsv file]] with all systems, configs and metrics for tst-COMMON and for the blind set. You can also [[https://​dl.fbaipublicfiles.com/​simultaneous_translation/​logs.tgz|download the submitted system logs]] to verify any result and for further analysis.
  
 === Cloud Credits Application === === Cloud Credits Application ===