Description

At ACL 2022, an ambitious 60-60 D&I Initiative was announced, targeting text and speech translation of the ACL Anthology and past recorded talks into 60 languages for the ACL’s 60th anniversary. Results of this ongoing effort will be shared with the community at ACL 2023 where IWSLT 2023 will be co-located. This track is a multilingual speech translation shared task evaluated on a subset of this data to involve the IWSLT community and larger community in this effort and spur conversations about related methodology and progress.

Data

This task is about speech translation by and for our field. Specifically, this track targets translation of oral presentations from past ACL events into a several languages. Talks cover a variety of technical content by speakers from around the world.

  • Evaluation data (development and test sets) consists of oral presentations from past ACL talks from the Anthology, with human post-edited transcripts and translations.
  • Training data includes publicly available corpora and pretrained models.
    • The source language and a subset of the target languages are shared with other talk translation tracks
    • Allowed training data is a superset of the data for all talk translation tracks - we include the same pretrained models and training corpora, with additional target languages
    • We encourage joint submissions across tracks to enable additional analysis and conference discussion!

Training data

Two training conditions are proposed. First is a constrained setting in which the allowed training data is limited to a medium-sized framework in order to keep the training time and resource requirements manageable. In order to allow participants to leverage existing multilingual models with medium-sized resources, particularly for this task where not all language pairs share similar amounts of public datasets, we propose a “constrained with large language models” condition, where a specific set of pretrained models is allowed to extend capabilities. We also encourage the participation of teams equipped with high computational power and additional resources to maximize performance on the task, and so an “unconstrained” setting without data restrictions is also proposed.

  • Constrained with pretrained models: Under this condition, all the constrained resources plus a restricted selection of pretrained models are allowed. The following pretrained models are considered part of the training data and freely usable to build submission systems:
    Constrained training data (click to expand)
    Data typesrc langtgt langTraining corpus (URL)VersionComment
    speechen--LibriSpeechv12
    speechen--How2
    speechen--Mozilla Common Voicev11.0
    speechen--TED LIUMV2/V3
    speechen--Vox Populi
    speech-to-text-parallelenallMuST-Cv1.2/v2.0/v3.0(10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr
    speech-to-text-parallelenallCoVoSTv2(10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr
    speech-to-text-parallelenallEuroparl-STv1.1(4) fr, de, pt, tr
    text-parallelenallEuroparlv10(2) fr, de
    text-parallelenallEuroparlv7(4) nl, fr, de, pt
    text-parallelenallNewsCommentaryv16(8) ar, zh, nl, fr, de, ja, pt, ru
    text-parallelenallOpenSubtitlesv2018(10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr
    text-parallelendeTED2020v1(1) de
    text-parallelenjaJParaCrawl(1) ja
    text-parallelenallTatoebav2022-03-03(10) ar, zh, nl, fr, de, ja, fa, pt, ru, tr
    text-parallelendeELRC-CORDIS_Newsv1(1) de
    Constrained pretrained models (click to expand)
  • Unconstrained: Any resource (additional datasets or pretrained language models included) can be used, with the important exception of evaluation sets and any data from ACL 2022 not provided on this page.

Development data

To mimic realistic test conditions where talk audio would be provided as a single file, not gold-segmented, we provide the full wav files and also automatically generated segments using SHAS as a baseline segmentation. To evaluate translation quality of system output using any input segmentation, we provide gold sentence-segmented transcripts and translations, which system output can be scored against using resegmentation following the steps below. We provide the full wav files to enable research into alternative segmentation methods.

The development data is released here.

Evaluation data

The blind evaluation data follows the same format as above. References will be released after the eval period.

The evaluation data is released here.

Full Dataset with References

The full ACL 60-60 dataset with references is hosted on the ACL Anthology here.
If you use this data in your work, we ask that you please cite the dataset paper as below:

@inproceedings{salesky-etal-2023-evaluating,
    title = "Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology",
    author = "Salesky, Elizabeth  and
      Darwish, Kareem  and
      Al-Badrashiny, Mohamed  and
      Diab, Mona  and
      Niehues, Jan",
    booktitle = "Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada (in-person and online)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.iwslt-1.2",
    pages = "62--78",
    abstract = "We present the ACL 60/60 evaluation sets for multilingual translation of ACL 2022 technical presentations into 10 target languages. This dataset enables further research into multilingual speech translation under realistic recording conditions with unsegmented audio and domain-specific terminology, applying NLP tools to text and speech in the technical domain, and evaluating and improving model robustness to diverse speaker demographics.",
}

Languages

This task covers ten language pairs with English as the source language and ten 60-60 languages as target languages. With this number of target languages, participants are encouraged to pursue multilingual modeling and submit results to all pairs (as opposed to individual models for each language pair), though models of any type are allowed.

  • Source language: English
  • Target languages: Arabic, Chinese, Dutch, French, German, Japanese, Farsi, Portuguese, Russian, Turkish
    • Publicly available corpora are available for these language pairs for training (e.g. MuST-C)

Submission

Submissions should be compressed into a single .tar.gz file and emailed here.
Translation into all 10 target languages is expected for official ranking, though we also encourage submissions to a subset of language pairs, and strongly encourage all participants to also submit English ASR for analysis.
Submissions should consist of plaintext files for each language pair with one sentence per line, pre-formatted for scoring (detokenized!). Multiple submissions are allowed! If multiple outputs are submitted, one system must be explicitly marked as primary, or the submission with the latest timestamp will be treated as primary.

File names should follow the following structure:
<participant>.<constrained/unconstrained>.<primary/contrastive>.<src>-<tgt>.txt
e.g., jhu.unconstrained.primary.en-de.txt

Participants should specify in the submission email if their submission uses multilingual models and uses end-to-end or cascaded models for analysis. Training data and any pretrained models used should also be specified in the submission email; if data or pretrained models beyond the list allowed are used, the system should be marked unconstrained and will be ranked separately.

Evaluation

Translation output will be evaluated using multiple metrics for analysis: translation output using chrF, BLEU, and recent neural metrics, and ASR output using WER. Translation metrics will be calculated with case and punctuation. WER will be computed on lowercased text with punctuation removed. Official metric scores will be calculated using automatic resegmentation of the hypothesis based on the reference transcripts (ASR) or translations (MT) by mwerSegmenter.

Ranking

The official task ranking will be based on the average chrF across the 10 translation language pairs, calculated by SacreBLEU. If a submission does not include a language pair, it will receive 0 for that pair. ASR will be evaluated separately, though it is strongly encouraged to submit ASR output as well. We will provide human evaluation for language pairs where available; if we are able to provide human evaluation for all 10 languages, average human system ranking will be the official task ranking.

Metrics

To compute official metrics, first download and install mwerSegmenter following the instructions in mwerSegmenter/README. Then, install SacreBLEU.

wget https://www-i6.informatik.rwth-aachen.de/web/Software/mwerSegmenter.tar.gz
tar -zxvf mwerSegmenter.tar.gz
# set up following mwerSegmenter/README

pip install sacrebleu

Then, given raw text translation output, run mwerSegmenter to segment it to match the reference, and evaluate with SacreBLEU:

# example: en-de
tgt=de
src=IWSLT.ACLdev2023/text/IWSLT.ACL.ACLdev2023.en-xx.en.xml
ref=IWSLT.ACLdev2023/text/IWSLT.ACL.ACLdev2023.en-xx.${tgt}.xml
out=outs/IWSLT.ACLdev2023.en-${tgt}.hyp
sys=baseline

grep "<seg id" ${ref} | sed -e "s/<[^>]*>//g" > ${ref%.xml}.txt

mwerSegmenter/segmentBasedOnMWER.sh ${src} ${ref} ${out} ${sys} ${tgt} ${out}.sgm no_normalize 1
sed -e '/^<\/\?seg\|^<\/\?doc\|^<\/\?tstset/d' ${out}.sgm > ${out}.final

conda activate py3
sacrebleu ${ref%.xml}.txt -i ${out}.final -m chrf

Note: unfortunately mwerSegmenter requires python2, and sacrebleu requires python3. You may need to switch environments between steps as shown.

Notes on Metric Tokenizers

We use chrF as the primary metric which enables use of the same metric for all target languages. For some languages, in particular those which do not mark whitespace, it can be recommended to use language-specific tokenization to calculate BLEU (Chinese, Japanese, Korean). Similarly, mwerSegmenter uses whitespace and segment boundaries for resegmentation, which for these languages may require character-level tokenization or language-specific tokenization. We will use the language-specific tokenizers recommended in sacrebleu (zh, ja-mecab, ko-mecab) for Chinese, Japanese, and Korean – note, though, that BLEU will be an unofficial metric. For all other languages we will use default metric tokenization (13a in sacrebleu, XLM-R tokenization for COMET). For chrF, language-specific tokenization should not change the score. It is important that you submit detokenized ASR and MT outputs so that metric tokenization can be applied appropriately.

Organizers

  • Elizabeth Salesky (JHU)
  • Jan Niehues (KIT)
  • Mona Diab (Meta)

Contact

Chairs: iwslt.multilingual@gmail.com
Discussion: iwslt-evaluation-campaign@googlegroups.com