This is an old revision of the document!
Open Domain Translation
The goal of this shared task is to promote:
- research on translation between Asian languages,
- exploitation of noisy, parallel web corpora for MT, and
- smart processing of data and provenance.
We provide a large, noisy set of Japanese-Chinese segment pairs built from web data. We evaluate system translations on a (secret) mixed-genre test set, curated for high quality segment pairs. After receiving test data, participants have one week to submit translations. After all submissions are received, we will post a populated leaderboard that will continue to receive post-evaluation submissions.
To participate (and obtain access to the data): Please register to the shared task competition page on Codalab (link) [needs signup on CodaLab platform before registering to the competition]
- We offer two tasks:
- Japanese-to-Chinese MT
- Chinese-to-Japanese MT
We encourage the participants to participate in both these tracks.
The evaluation metric for the shared task is 4-gram character Bleu. The script to be used for Bleu computation is here (almost identical to that in Moses with a few minor differences). Instructions to run the script is in the baseline code that we released for the shared task. (link)
Allowed Training data
We encourage participants to use only the provided parallel training data. Use of other data is allowed, if thoroughly documented (and in principle, publicly available). Participants must be willing to write a system description paper (in English), to promote knowledge sharing and rapid advancement of the field.
In addition to the web data, we also provide existing Japanese-Chinese parallel data from various public sources (parallel data released as part of previous Japanese-Chinese MT efforts or available as downloadable resources).
Format of the files being released:
- web_crawled_parallel_filtered.tar.gz : 3 files (zh, ja, domains) of the sentences that we obtained from crawling the Web, aligning and filtering.
- existing_parallel.tar.gz : 3 files (zh, ja, domains) of the sentences that we obtained from curating existing Japanese-Chinese parallel datasets.
- web_crawled_parallel_unfiltered.tar.gz - 3 files (zh, ja, domains) of the pre-filtered sentences.
- web_crawled_unaligned.tar.gz - 2 files (zh, ja) of the scraped text with the document boundaries.
Please note: Getting access to data, requires registration to the shared task on the CodaLab platform (link below)
Statistics on the training data
|Source||# parallel segments||# chars (Chinese side)|
|Existing parallel sources||1,963,238||33,522,339|
Development and evaluation dataset
The development and evaluation datasets will consist of data from a diverse set of domains.
Link to datasets and participation
Links to data and submission of participating system runs is through CodaLab.
Link to CodaLab: https://competitions.codalab.org/competitions/21430 (JA –> ZH), https://competitions.codalab.org/competitions/23892 (ZH –> JA)
Requires registration (link) to download the data and participate in the competition.
UPDATE: We released the training and development datasets on Jan 17, 2020. Please register to access the datasets and participate in the competition. We also released the baseline model code with instructions to train a neural MT system. (link) Please feel free to reach out to Ajay Nagesh (email@example.com) in case you have any questions or concerns. Looking forward to your participation!
Chair: Ajay Nagesh (DiDi Labs, USA)
Amittai Axelrod (DiDi Labs, USA)
Arkady Arkhangorodsky (DiDi Labs, USA)
Boliang Zhang (DiDi Labs, USA)
Xing Shi (DiDi Labs, USA)
Yiqi Huang (DiDi Labs, USA)