Description

At ACL 2022, an ambitious 60-60 D&I Initiative was announced, targeting text and speech translation of the ACL Anthology and past recorded talks into 60 languages for the ACL’s 60th anniversary. Results of this ongoing effort will be shared with the community at ACL 2023 where IWSLT 2023 will be co-located. This track is a multilingual speech translation shared task evaluated on a subset of this data to involve the IWSLT community and larger community in this effort and spur conversations about related methodology and progress.

Data

This task is about speech translation by and for our field. Specifically, this track targets translation of oral presentations from past ACL events into a several languages. Talks cover a variety of technical content by speakers from around the world.

  • Evaluation data (development and test sets) consists of oral presentations from past ACL talks from the Anthology, with human post-edited transcripts and translations.
  • Training data includes publicly available corpora and pretrained models.
    • The source language and a subset of the target languages are shared with other talk translation tracks
    • Allowed training data is a superset of the data for all talk translation tracks - we include the same pretrained models and training corpora, with additional target languages
    • We encourage joint submissions across tracks to enable additional analysis and conference discussion!

Training data

A constrained setting is proposed as the primary task condition, in which the allowed training data is limited to a medium-sized framework in order to keep the training time and resource requirements manageable. In order to allow participants to leverage large multilingual models with medium-sized resources, particularly for this task where not all language pairs share similar amounts of public datasets, we propose a “constrained with large language models” condition, where a specific set of pretrained models is allowed to extend capabilities. In order to also encourage the participation of teams equipped with high computational power and additional resources to maximize performance on the task, an “unconstrained” setup without data restrictions is also proposed.

  • Constrained with pretrained models: Under this condition, all the constrained resources plus a restricted selection of pretrained models are allowed. The following pretrained models are considered part of the training data and freely usable to build submission systems:
    Constrained training data
    Data typesrc langtgt langTraining corpus (URL)VersionComment
    speechen--LibriSpeechv12
    speechen--How2
    speechen--Mozilla Common Voicev11.0
    speechen--TED LIUMV2/V3
    speechen--Vox Populi
    speech-to-text-parallelenallMuST-Cv1.2/v2.0/v3.0(10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr
    speech-to-text-parallelenallCoVoSTv2(10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr
    speech-to-text-parallelenallEuroparl-STv1.1(4) fr, de, pt, tr
    text-parallelenallEuroparlv10(2) fr, de
    text-parallelenallEuroparlv7(4) nl, fr, de, pt
    text-parallelenallNewsCommentaryv16(8) ar, zh, nl, fr, de, ja, pt, ru
    text-parallelenallOpenSubtitlesv2018(10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr
    text-parallelendeTED2020v1(1) de
    text-parallelenallTatoebav2022-03-03(10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr
    text-parallelendeELRC-CORDIS_Newsv1(1) de
    Constrained pretrained models
  • Unconstrained: Any resource (additional datasets or pretrained language models included) can be used, with the exception of evaluation sets

Development data

Development data will be released the week of February 6.

Languages

This task covers ten language pairs with English as the source language and ten 60-60 languages as target languages. With this number of target languages, participants are encouraged to pursue multilingual modeling and submit results to all pairs (as opposed to individual models for each language pair), though models of any type are allowed.

  • Source language: English
  • Target languages: Arabic, Chinese, Dutch, French, German, Japanese, Farsi, Portuguese, Russian, Turkish
    • Publicly available corpora are available for these language pairs for training (e.g. MuST-C)

Submission

Submissions should be compressed into a single .tar.gz file and emailed here.
Translation into all 10 target languages is expected for official ranking, though we also encourage submissions to a subset of language pairs, and strongly encourage all participants to also submit English ASR for analysis.
Submissions should consist of plaintext files for each language pair with one sentence per line, pre-formatted for scoring (detokenized). Multiple submissions are allowed! If multiple outputs are submitted, one system must be explicitly marked as primary, or the submission with the latest timestamp will be treated as primary.

File names should follow the following structure:
<participant>.<constrained/unconstrained>.<primary/contrastive>.<src>-<tgt>.txt
e.g., jhu.primary.en-de.txt

Participants should specify in the submission email if their submission uses multilingual models and uses end-to-end or cascaded models for analysis. Training data and any pretrained models used should also be specified in the submission email; if data or pretrained models beyond the list allowed are used, the system should be marked unconstrained and will be ranked separately.

Evaluation

Translation output will be evaluated using multiple metrics for analysis: translation output using chrF, BLEU, and recent neural metrics, and ASR output using WER. Translation metrics will be calculated with case and punctuation. Official chrF and BLEU scores will be calculated using automatic resegmentation of the hypothesis based on the reference translation by mwerSegmenter, though we will also compute segment-based scores for analysis. WER will be computed on lowercased text with punctuation and hesitations removed (handled by the scoring script, linked here after data is released).

Ranking

The official task ranking will be based on the average chrF across the 10 translation language pairs, calculated by SacreBLEU. If a submission does not include a language pair, it will receive 0 for that pair. ASR will be evaluated separately, though it is strongly encouraged to submit ASR output as well. We will provide human evaluation for language pairs where available; if all 10 languages are able to be covered, average human system ranking will be the official task ranking.

Organizers

  • Elizabeth Salesky (JHU)
  • Jan Niehues (KIT)
  • Mona Diab (Meta)

Contact

Chairs: iwslt.multilingual@gmail.com
Discussion: iwslt-evaluation-campaign@googlegroups.com