Multilingual track: ACL 60-60 initiative
Description
At ACL 2022, an ambitious 60-60 D&I Initiative was announced, targeting text and speech translation of the ACL Anthology and past recorded talks into 60 languages for the ACL’s 60th anniversary. Results of this ongoing effort will be shared with the community at ACL 2023 where IWSLT 2023 will be co-located. This track is a multilingual speech translation shared task evaluated on a subset of this data to involve the IWSLT community and larger community in this effort and spur conversations about related methodology and progress.
Data
This task is about speech translation by and for our field. Specifically, this track targets translation of oral presentations from past ACL events into a several languages. Talks cover a variety of technical content by speakers from around the world.
- Evaluation data (development and test sets) consists of oral presentations from past ACL talks from the Anthology, with human post-edited transcripts and translations.
- Training data includes publicly available corpora and pretrained models.
- The source language and a subset of the target languages are shared with other talk translation tracks
- Allowed training data is a superset of the data for all talk translation tracks - we include the same pretrained models and training corpora, with additional target languages
- We encourage joint submissions across tracks to enable additional analysis and conference discussion!
Training data
A constrained setting is proposed as the primary task condition, in which the allowed training data is limited to a medium-sized framework in order to keep the training time and resource requirements manageable. In order to allow participants to leverage large multilingual models with medium-sized resources, particularly for this task where not all language pairs share similar amounts of public datasets, we propose a “constrained with large language models” condition, where a specific set of pretrained models is allowed to extend capabilities. In order to also encourage the participation of teams equipped with high computational power and additional resources to maximize performance on the task, an “unconstrained” setup without data restrictions is also proposed.
- Constrained with pretrained models: Under this condition, all the constrained resources plus a restricted selection of pretrained models are allowed. The following pretrained models are considered part of the training data and freely usable to build submission systems:
Constrained training data
Data type src lang tgt lang Training corpus (URL) Version Comment speech en -- LibriSpeech v12 speech en -- How2 speech en -- Mozilla Common Voice v11.0 speech en -- TED LIUM V2/V3 speech en -- Vox Populi speech-to-text-parallel en all MuST-C v1.2/v2.0/v3.0 (10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr speech-to-text-parallel en all CoVoST v2 (10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr speech-to-text-parallel en all Europarl-ST v1.1 (4) fr, de, pt, tr text-parallel en all Europarl v10 (2) fr, de text-parallel en all Europarl v7 (4) nl, fr, de, pt text-parallel en all NewsCommentary v16 (8) ar, zh, nl, fr, de, ja, pt, ru text-parallel en all OpenSubtitles v2018 (10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr text-parallel en de TED2020 v1 (1) de text-parallel en all Tatoeba v2022-03-03 (10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr text-parallel en de ELRC-CORDIS_News v1 (1) de - Unconstrained: Any resource (additional datasets or pretrained language models included) can be used, with the exception of evaluation sets
Development data
Development data will be released the week of February 6.
Languages
This task covers ten language pairs with English as the source language and ten 60-60 languages as target languages. With this number of target languages, participants are encouraged to pursue multilingual modeling and submit results to all pairs (as opposed to individual models for each language pair), though models of any type are allowed.
- Source language: English
- Target languages: Arabic, Chinese, Dutch, French, German, Japanese, Farsi, Portuguese, Russian, Turkish
- Publicly available corpora are available for these language pairs for training (e.g. MuST-C)
Submission
Submissions should be compressed into a single .tar.gz file and emailed here.
Translation into all 10 target languages is expected for official ranking, though we also encourage submissions to a subset of language pairs, and strongly encourage all participants to also submit English ASR for analysis.
Submissions should consist of plaintext files for each language pair with one sentence per line, pre-formatted for scoring (detokenized).
Multiple submissions are allowed! If multiple outputs are submitted, one system must be explicitly marked as primary, or the submission with the latest timestamp will be treated as primary.
File names should follow the following structure:
<participant>.<constrained/unconstrained>.<primary/contrastive>.<src>-<tgt>.txt
e.g., jhu.primary.en-de.txt
Participants should specify in the submission email if their submission uses multilingual models and uses end-to-end or cascaded models for analysis. Training data and any pretrained models used should also be specified in the submission email; if data or pretrained models beyond the list allowed are used, the system should be marked unconstrained and will be ranked separately.
Evaluation
Translation output will be evaluated using multiple metrics for analysis: translation output using chrF, BLEU, and recent neural metrics, and ASR output using WER. Translation metrics will be calculated with case and punctuation. Official chrF and BLEU scores will be calculated using automatic resegmentation of the hypothesis based on the reference translation by mwerSegmenter, though we will also compute segment-based scores for analysis. WER will be computed on lowercased text with punctuation and hesitations removed (handled by the scoring script, linked here after data is released).
Ranking
The official task ranking will be based on the average chrF across the 10 translation language pairs, calculated by SacreBLEU. If a submission does not include a language pair, it will receive 0 for that pair. ASR will be evaluated separately, though it is strongly encouraged to submit ASR output as well. We will provide human evaluation for language pairs where available; if all 10 languages are able to be covered, average human system ranking will be the official task ranking.
Organizers
- Elizabeth Salesky (JHU)
- Jan Niehues (KIT)
- Mona Diab (Meta)
Contact
Chairs: iwslt.multilingual@gmail.com
Discussion: iwslt-evaluation-campaign@googlegroups.com