Isometric translation task refers to generating translations similar in length to the source. Despite the fast paced progress to improve quality, generating isometric translations is relatively a new problem in machine translation (MT) research and application. Isometric translation can be applied in a wide range of real world applications such as automatic dubbing (to achieve synchrony between source and target speech), sub-titling (to fit the video frame), simultaneous speech translation (to control the reading or listening effort), and layout constrained translation (i.e. document table or database field). Hence, building MT models that can generate increasingly isometric translation while maintaining the translation quality can have a far reaching impact in diverse MT use cases.

Example translations from an English source to German using a Baseline and an Isometric MT models show how the latter can better fit the source length template, while preserving meaning.

Source It is actually the true integration of the man and the machine.  
Baseline MT Es ist tatsächlich die wahre Integration von Mensch und Maschine.  
Isometric MT Es ist die wirkliche Integration von Mensch und Maschine.  
Source But still it was a real footrace against the other volunteers to get to the captain in charge to find out what our assignments would be.  
Baseline MT Aber es war trotzdem ein echtes Rennen gegen die anderen Freiwilligen, um zum verantwortlichen Kapitän zu kommen, um herauszufinden, was unsere Aufgaben sein würden.  
Isometric MT Aber es war ein Wettlauf gegen die anderen Freiwilligen, um zum verantwortlichen Kapitän zu kommen, um unsere Aufgaben herauszufinden.  

Language Pairs

For this first isometric translation task, we consider a text-to-text translation track. We focus on three language directions:

  1. English-French (En-Fr)
  2. English-German (En-De)
  3. English-Spanish (En-Es)

These language pairs exhibit different degrees of target to source length ratio in character count, for instance target-source length ratio for the training data from MuST-C for En-Fr is 1.11, En-De is 1.12 and En-Es is 1.04. These ratios make it ideal to assess the generalization capability of proposed isometric MT approaches. Participants are encouraged to evaluate their approaches using all language pairs.


Given the requirement of isometric translation, submitted systems are evaluated along two dimensions, translation quality and length compliance.

Translation Quality (TQ)

As the goal in isometric MT is to control the translation length, an ideal translation quality evaluation metric should be robust to any length variations in the translations. We considered the n-gram based metric BLEU and embedding based metrics COMET and BERTScore, and our analysis showed that BERTScore is more robust to length variations in the hypotheses as compared to BLEU and COMET, both of which tend to penalize short translations, even if the semantics is preserved. Thus, we will use BERTScore to measure translation quality.

Length Compliance (LC)

We define length compliance (LC) as the percentage of translations in a given test set falling in a predefined length threshold of ±10% of the number of characters in the source sentence. That is, if the source length is 50 characters, a length compliant translation is between 45 to 55 characters. We calculate how many translations fall in this bracket and report the percentage over a test set. This threshold is motivated by recent finding, that shows that if the translation length stays within a ±10% range, it is easier to synchronize source and target speech for use cases like automatic dubbing. In this evaluation, LC is applied only for translations with length above 10 characters.

System Ranking

All submissions will be ranked based on a combination of BERTScore and Length Compliance.

Systems submitted for the above language pairs, will be evaluated on MuST-C (tst-COMMON) and blind test sets (see dataset section for more details). In addition to their primary system, participants are encouraged to submit multiple contrastive runs.


Training Sets

Participants may use text-to-text training data available in the MuST-C v1.2 offline speech translation corpus (please refer to the offline speech translation task for more detail), or the dataset is available here. In addition participants can use the latest parallel data for each of the language pairs from WMT for their model training. Submission information should state what type and amount of data are used for model training. Depending on the used data, submissions are divided into two training data regimes:

Constrained task

  • Can only use the textual MuST-C v1.2 data.

Unconstrained task

  • Can use the textual MuST-C v1.2 data,
  • WMT data and pre-trained translation models

Test Sets

We will evaluate the submitted systems on a test set and a blind set:

  • MuST-C test set (tst-COMMON) which is a public dataset.
  • Blind set is curated by the organizers for En → Fr, De, Es. Participants will get access to the English source sentences when evaluation starts and references will be released after the shared task is completed.

System Submission

Participants are asked to submit the output of their system(s), for one or more of the evaluation language pairs. In addition to the system outputs, participants are required to submit the performance of their system in terms of both the BERTScore and LC metrics. The statistics will be used by the organizers to compare their assessment of submitted systems with that of participants. Details of the evaluation script will be made available when the evaluation data is released.

Submission file should be packaged and named as isometric_mt_participant-name.tar.gz.
Package should be organized per source to target language pair (such as En-Fr) and include:

  • tst-COMMON.SRC: the source file used for translation
  • isometric-mt-id.TGT: the output of participants isometric MT, id is used to distinguish if several approaches/system runs are submitted.
  • should include
    • A brief description of each *isometric-mt-id system submitted
    • System performance as computed by the participant
    • Training data condition
    • Institution/contact person

NB: both the source and the MT output should be submitted in a detokenized format.


  • Surafel M. Lakew (Amazon AI)
  • Prashant Mathur (Amazon AI)
  • Natawut Monaikul (Amazon AI)
  • Marcello Federico (Amazon AI)


Question and Discussion on the task: