Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
open_domain_translation [2020/02/11 20:03]
anagesh
open_domain_translation [2020/04/23 00:14] (current)
anagesh
Line 8: Line 8:
  
 We provide a large, noisy set of Japanese-Chinese segment pairs built from web data. We evaluate system translations on a (secret) mixed-genre test set, curated for high quality segment pairs. After receiving test data, participants have one week to submit translations. After all submissions are received, we will post a populated leaderboard that will continue to receive post-evaluation submissions. We provide a large, noisy set of Japanese-Chinese segment pairs built from web data. We evaluate system translations on a (secret) mixed-genre test set, curated for high quality segment pairs. After receiving test data, participants have one week to submit translations. After all submissions are received, we will post a populated leaderboard that will continue to receive post-evaluation submissions.
 +
 +**To participate (and obtain access to the data):** Please register to the shared task competition page on **Codalab ([[https://​competitions.codalab.org/​competitions/​21430#​participate|ja->​zh]],​ [[https://​competitions.codalab.org/​competitions/​23892#​participate|zh->​ja]])** [needs signup on CodaLab platform before registering to the competition]
  
 === Evaluation Conditions === === Evaluation Conditions ===
Line 17: Line 19:
 We encourage the participants to participate in both these tracks. ​ We encourage the participants to participate in both these tracks. ​
  
-The evaluation metric for the shared task is 4-gram character Bleu. The script to be used for Bleu computation is <​link>​. Instructions to run the script ​<fill-in>+The evaluation metric for the shared task is 4-gram character Bleu. The script to be used for Bleu computation is [[https://​github.com/​didi/​iwslt2020_open_domain_translation/​blob/​master/​scripts/​multi-bleu-detok.perl|here]] (almost identical to that in [[https://​github.com/​moses-smt/​mosesdecoder/​blob/​master/​scripts/​generic/​multi-bleu-detok.perl|Moses]] with a few minor differences). Instructions to run the script ​is in the baseline code that we released for the shared task. ([[https://​github.com/​didi/​iwslt2020_open_domain_translation|link]])
  
 === Allowed Training data === === Allowed Training data ===
  
-We encourage participants to use only the provided parallel training data.  Use of other data is allowed, if thoroughly documented. Participants must be willing to write a system description paper (in English), to promote knowledge sharing and rapid advancement of the field. ​+We encourage participants to use only the provided parallel training data.  Use of other data is allowed, if thoroughly documented ​(and in principle, publicly available). Participants must be willing to write a system description paper (in English), to promote knowledge sharing and rapid advancement of the field. ​
  
 In addition to the web data, we also provide existing Japanese-Chinese parallel data from various public sources (parallel data released as part of previous Japanese-Chinese MT efforts or available as downloadable resources). ​ In addition to the web data, we also provide existing Japanese-Chinese parallel data from various public sources (parallel data released as part of previous Japanese-Chinese MT efforts or available as downloadable resources). ​
  
 Format of the files being released: Format of the files being released:
-  - web_crawled_parallel_filtered.tar.gz ​[LINK] ​: 3 files (zh, ja, domains) of the sentences that we obtained from crawling the Web, aligning and filtering.  +  - web_crawled_parallel_filtered.tar.gz : 3 files (zh, ja, domains) of the sentences that we obtained from crawling the Web, aligning and filtering.  
-  - existing_parallel.tar.gz ​[LINK] ​: 3 files (zh, ja, domains) of the sentences that we obtained from curating existing Japanese-Chinese parallel datasets. +  - existing_parallel.tar.gz : 3 files (zh, ja, domains) of the sentences that we obtained from curating existing Japanese-Chinese parallel datasets. 
-  - web_crawled_parallel_unfiltered.tar.gz ​[LINK] ​files (zh, ja, domains, hunalign_scores) of the pre-filtered sentences. +  - web_crawled_parallel_unfiltered.tar.gz - files (zh, ja, domains) of the pre-filtered sentences. 
-  - web_crawled_unaligned.tar.gz ​[LINK] ​- 2 files (zh, ja) of the scraped text with the document boundaries.+  - web_crawled_unaligned.tar.gz - 2 files (zh, ja) of the scraped text with the document boundaries. 
 + 
 +**Please note**: Getting access to data, requires registration to the shared task on the CodaLab platform (link below) ​
  
 Statistics on the training data Statistics on the training data
Line 41: Line 45:
 The development and evaluation datasets will consist of data from a diverse set of domains. ​ The development and evaluation datasets will consist of data from a diverse set of domains. ​
  
-The development data will be representative of the kind of data that the participating systems will be tested via the secret test sets. +=== Link to datasets and participation ​===
- +
-=== Link to training and datasets ​ and submission instructions ​===+
  
-Submission ​is through CodaLab. ​+Links to data and submission of participating system runs is through CodaLab. ​
  
-Link to CodaLab: https://​competitions.codalab.org/​competitions/​21430+Link to CodaLab: https://​competitions.codalab.org/​competitions/​21430 ​(JA --> ZH), https://​competitions.codalab.org/​competitions/​23892 (ZH --> JA)
  
-Requires registration to download the data and participate in the competition. ​+Requires registration ​([[https://​competitions.codalab.org/​competitions/​21430#​participate|link]]) ​to download the data and participate in the competition. ​
  
-**__UPDATE__**:​ **We released the training and development datasets on Jan 17, 2020 (and approved all the the participants who registered). Please register to access the datasets and participate in the competition.+**__UPDATE__**:​ **We released the training and development datasets on Jan 17, 2020. Please register to access the datasets and participate in the competition.
  
-We will also be releasing a baseline modelinstructions to train the model and leaderboard with the baseline results in the coming weekSo stay tuned!+We also released the baseline model code with instructions to train a neural MT system. ([[https://​github.com/​didi/​iwslt2020_open_domain_translation|link]])
  
 Please feel free to reach out to Ajay Nagesh (ajaynagesh@didiglobal.com) in case you have any questions or concerns. Looking forward to your participation! Please feel free to reach out to Ajay Nagesh (ajaynagesh@didiglobal.com) in case you have any questions or concerns. Looking forward to your participation!
Line 66: Line 68:
 Boliang Zhang (DiDi Labs, USA)\\ Boliang Zhang (DiDi Labs, USA)\\
 Xing Shi (DiDi Labs, USA)\\ Xing Shi (DiDi Labs, USA)\\
 +Yiqi Huang (DiDi Labs, USA)\\