Open Domain Translation
The goal of this shared task is to promote:
- research on translation between Asian languages,
- exploitation of noisy, parallel web corpora for MT, and
- smart processing of data and provenance.
We provide a large, noisy set of Japanese-Chinese segment pairs built from web data. We evaluate system translations on a (secret) mixed-genre test set, curated for high quality segment pairs. After receiving test data, participants have one week to submit translations. After all submissions are received, we will post a populated leaderboard that will continue to receive post-evaluation submissions.
- We offer two tasks:
- Japanese-to-Chinese MT
- Chinese-to-Japanese MT
We encourage the participants to participate in both these tracks.
The evaluation metric for the shared task is 4-gram character Bleu. The script to be used for Bleu computation is <link>. Instructions to run the script <fill-in>
Allowed Training data
We encourage participants to use only the provided parallel training data. Use of other data is allowed, if thoroughly documented. Participants must be willing to write a system description paper (in English), to promote knowledge sharing and rapid advancement of the field.
In addition to the web data, we also provide existing Japanese-Chinese parallel data from various public sources (parallel data released as part of previous Japanese-Chinese MT efforts or available as downloadable resources).
Format of the files being released:
- web_crawled_parallel_filtered.tar.gz [LINK] : 3 files (zh, ja, domains) of the sentences that we obtained from crawling the Web, aligning and filtering.
- existing_parallel.tar.gz [LINK] : 3 files (zh, ja, domains) of the sentences that we obtained from curating existing Japanese-Chinese parallel datasets.
- web_crawled_parallel_unfiltered.tar.gz [LINK] - 4 files (zh, ja, domains, hunalign_scores) of the pre-filtered sentences.
- web_crawled_unaligned.tar.gz [LINK] - 2 files (zh, ja) of the scraped text with the document boundaries.
Statistics on the training data
|Source||# parallel segments||# chars (Chinese side)|
|Existing parallel sources||1,963,238||33,522,339|
Development and evaluation dataset
The development and evaluation datasets will consist of data from a diverse set of domains.
The development data will be representative of the kind of data that the participating systems will be tested via the secret test sets.
Link to training and datasets and submission instructions
Submission is through CodaLab.
Link to CodaLab: https://competitions.codalab.org/competitions/21430
Requires registration to download the data and participate in the competition.
UPDATE: We released the training and development datasets on Jan 17, 2020 (and approved all the the participants who registered). Please register to access the datasets and participate in the competition. We will also be releasing a baseline model, instructions to train the model and a leaderboard with the baseline results in the coming week. So stay tuned! Please feel free to reach out to Ajay Nagesh (email@example.com) in case you have any questions or concerns. Looking forward to your participation!
Chair: Ajay Nagesh (DiDi Labs, USA)
Amittai Axelrod (DiDi Labs, USA)
Arkady Arkhangorodsky (DiDi Labs, USA)
Boliang Zhang (DiDi Labs, USA)
Xing Shi (DiDi Labs, USA)