Non-Native Speech Translation
Speech recognition and translation are achieving huge improvements over the last few years, as reported in numerous scientific papers. Yet taking current models out of the box and applying them in a random practical situation, often quickly leads to disillusion. The models perform great in laboratory conditions, with studio quality recording. Run them on speech of high school students at a fair and you will see error rates highly above 40% or complete failures.
The goal of the Non-Native Speech Translation Task is to examine the quality of English-to-Czech and English-to-German SLT in a realistic setting of non-native spontaneous speech, in somewhat noisy conditions. The task seeks submissions that proceed along the standard two-stage pipeline (ASR+MT) as well as end-to-end solutions, ideally recovering from disfluencies of all kinds: pronunciation, vocabulary choice as well as grammar.
The automatic evaluation of the task will be carried out in multiple tracks. The two primary criteria are:
- Raw ASR quality in terms of WER against the reference transcript
- Raw translation quality, comparing the final candidate translation with one or more references.
In link with the simultaneous speech translation challenge, we will also able to assess:
- SLT delay, based on timestamps of words appearing after MT and automatic word alignment with the reference transcript, and
- flicker, reflecting the effort wasted in reading intermediate and later edited outputs.
Depending on the number of submissions we may be able to also add a manual evaluation of the translation quality, i.e. the human standard for the criterion (2).
Participants can provide complete or partial solutions, starting at the non-segmented audio,
timestamped gold transcript, or our baseline ASR output. The ideal expected submissions will include SLT output with updates, timestamped at the point when the output was emitted by MT.
As inputs, we will release audio files, unsegmented. The durations will vary between 1 minute up to a couple of dozen minutes. For each such file, we expect plaintext outputs of ASR and/or MT, in formats described below. Some parts of these outputs are mandatory, others are optional and only help to provide a more fine-grained analysis.
You can take part in ASR only, ASR followed by MT (we are interested in your ASR outputs) or joint SLT (spoken language translation, where the ASR outputs are not available at all). MT-only submissions are also possible, please contact Ondřej Bojar directly to obtain baseline ASR outputs.
You can make as many submissions as you like, but you must indicate one of the submissions as PRIMARY.
Allowed Training Data
The non-native task distinguishes between constrained and non-constrained submissions.
Constrained submissions can use only the following datasets (resources are listed several times contain data relevant for multiple stages of processing):
- English ASR
- Mozilla Common Voice; for English use version en_1488h_2019-12-10
- English→Czech translation
- MuST-C corpus (release 1.1 contains English-Czech pair)
- English→German translation
- Multi-lingual training is welcome and counts as constrained, if limited to the above datasets.
- Important note for participants of Offline Speech Translation Task: There is an overlap between CzEng and Offline Speech Translation Task test set. Creating a multi-lingual system and including CzEng in training for English-German translation is thus not permitted for Offline Speech Translation task.
Non-constrained submissions are very welcome and can use any additional data.
A small development set will be released by the end of January 2020.
The development set will illustrate the intended domains but it may be too small for reliable measurements.
File Format of ASR Candidates
For ASR-only submissions or for ASR+MT submissions, we expect ASR Candidate file in the following format. The format is sentence-oriented, based on your custom segmentation, case-sensitive and punctuated, i.e. you should provide correct casing and typesetting of your output. (It is better to submit just one huge sentence lowercased, if you cannot provide segmentation, than to give up altogether.)
Each line of the ASR file shows the output of your ASR system which gradually grows in subsequent lines until a sentence is completed. (We use the term sentence to whatever unit most closely resembles sentences. Usually, a sentence is ended with a punctuation mark but this is not any formal requirement.) A completed sentence needs to come as a separate line, again followed by growing partial outputs, for instance:
P 600 0 500 Good P 800 0 650 Good mor P 1130 0 1020 Good morning P 1300 0 1190 Good morning how P 1480 0 1400 Good morning. How are P 2010 0 1950 Good morning. How are you? C 2010 0 1020 Good morning. P 2200 1020 2180 How are you? I C 2200 1020 1950 How are you? P 2450 1950 2390 I am ...
The partial (“P”) candidates allow your system to extend or revise its outputs, trading precision for lower latency and higher flicker. The P segments are not considered in the evaluation of accuracy. Only the complete (“C”) segments are required. For SLT-style submissions (end-to-end speech recognition and translation), this file is not required. Please provide it if you can, because it will allow for a more fine-grained evaluation.
There are three numbers (time stamps) in each line: display time, start time and end time. All times are measured in milliseconds from the start of the sound file.
Display time shows the time when the given line/sentence was recognized, produced by the ASR system. If your system is not “on-line” in any sense, you can report 0 on all lines. The start and end time indicate the span in which the respective words were uttered in the recording. If your system does not provide timestamps, again report zeros.
The minimal acceptable submission would thus contain only full sentences, preceded with “C 0 0 0 ” on each line.
The time stamps obey these rules:
- end time >= start time; the difference is the duration of the segment
- display time >= end time; the difference is the processing time of the ASR
- considering only “C” lines, the end time of the previous one generally matches the start time of the next one
- a row of “P” segments usually has the same start time, until a “C” segment with (the same) start time comes.
File Format of Machine Translation Candidates
The format of MT output file is formally identical to the ASR output file:
P 600 0 500 Gut P 800 0 650 Guten Morgen! P 1130 0 1020 Guten Morgen! P 1300 0 1190 Guten wie morgen P 1480 0 1400 Guten Morgen! Wie geht es? P 2010 0 1950 Guten Morgen! Wie geht es dir? C 2010 0 1020 Guten Morgen! P 2200 1020 2180 Wie geht es dir? Ich C 2200 1020 1950 Wie geht es dir? P 2450 1950 2390 Ich bin ...
Again, “P”artial candidates allow to revise your output so far and are fully optional (reducing latency). The “C”omplete candidates are required and the concatenation of all the “C” candidates correspond exactly to the translation of the whole test document.
Timestamps have the same roles: display time, start time, end time. Display time is the time when the translation was produced by the MT system. Start time and end time indicate the span in the source language speech when the source of this segment was uttered.
If your translation system was truly instant, you could keep the P/C marks and timestamps exactly as in the ASR output file. Because it is not instant, the display times will be higher in MT output that ASR output, but the start and end times will be very likely identical.
If your system does not support any live processing, you can set the timestamps to zero. Again, “C 0 0 0 ” is the minimal acceptable prefix for every line, indicating that you do not provide any partial outputs and timing information.
Chair: Ondrej Bojar (Charles University, Czech Republic)
Ebrahim Ansari (Charles University, Czech Republic)