Open Domain Translation

Task Description

The goal of this shared task is to promote:

  1. research on translation between Asian languages,
  2. exploitation of noisy, parallel web corpora for MT, and
  3. smart processing of data and provenance.

We provide a large, noisy set of Japanese-Chinese segment pairs built from web data. We evaluate system translations on a (secret) mixed-genre test set, curated for high quality segment pairs. After receiving test data, participants have one week to submit translations. After all submissions are received, we will post a populated leaderboard that will continue to receive post-evaluation submissions.

Evaluation Conditions

- We offer two tasks:

  1. Japanese-to-Chinese MT
  2. Chinese-to-Japanese MT

We encourage the participants to participate in both these tracks.

The evaluation metric for the shared task is 4-gram character Bleu. The script to be used for Bleu computation is <link>. Instructions to run the script <fill-in>

Allowed Training data

We encourage participants to use only the provided parallel training data. Use of other data is allowed, if thoroughly documented. Participants must be willing to write a system description paper (in English), to promote knowledge sharing and rapid advancement of the field.

In addition to the web data, we also provide existing Japanese-Chinese parallel data from various public sources (parallel data released as part of previous Japanese-Chinese MT efforts or available as downloadable resources).

Format of the files being released:

  1. web_crawled_parallel_filtered.tar.gz [LINK] : 3 files (zh, ja, domains) of the sentences that we obtained from crawling the Web, aligning and filtering.
  2. existing_parallel.tar.gz [LINK] : 3 files (zh, ja, domains) of the sentences that we obtained from curating existing Japanese-Chinese parallel datasets.
  3. web_crawled_parallel_unfiltered.tar.gz [LINK] - 4 files (zh, ja, domains, hunalign_scores) of the pre-filtered sentences.
  4. web_crawled_unaligned.tar.gz [LINK] - 2 files (zh, ja) of the scraped text with the document boundaries.

Statistics on the training data

Source # parallel segments # chars (Chinese side)
Web Crawled 60,103,053 2,405,122,355
Existing parallel sources 1,963,238 33,522,339

Development and evaluation dataset

The development and evaluation datasets will consist of data from a diverse set of domains.

The development data will be representative of the kind of data that the participating systems will be tested via the secret test sets.

Submission is through CodaLab.

Link to CodaLab: https://competitions.codalab.org/competitions/21430

Requires registration to download the data and participate in the competition.

UPDATE: We released the training and development datasets on Jan 17, 2020 (and approved all the the participants who registered). Please register to access the datasets and participate in the competition. We will also be releasing a baseline model, instructions to train the model and a leaderboard with the baseline results in the coming week. So stay tuned! Please feel free to reach out to Ajay Nagesh (ajaynagesh@didiglobal.com) in case you have any questions or concerns. Looking forward to your participation!

Contacts

Chair: Ajay Nagesh (DiDi Labs, USA)
Discussion: iwslt-evaluation-campaign@googlegroups.com

Organizers

Amittai Axelrod (DiDi Labs, USA)
Arkady Arkhangorodsky (DiDi Labs, USA)
Boliang Zhang (DiDi Labs, USA)
Xing Shi (DiDi Labs, USA)