Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
conversational_speech_translation [2020/02/05 20:38]
esalesky -- fisher mapping code clarification
conversational_speech_translation [2020/03/19 23:26] (current)
esalesky [Important Dates]
Line 14: Line 14:
 This challenge focuses on one such question: **what is the best way to produce fluent translations from disfluent speech?** Further information about disfluent conversational speech can be found below. This challenge focuses on one such question: **what is the best way to produce fluent translations from disfluent speech?** Further information about disfluent conversational speech can be found below.
  
-This task uses a smaller dataset than other tasks in the constrained setting, which we hope some groups may find more approachable. ​+This task uses a smaller dataset than other tasks, which we hope some groups may find more approachable. ​
 To enable wide participation,​ we have multiple participation options: To enable wide participation,​ we have multiple participation options:
   * We ask for submissions which translate from **speech**, or **text-only** using provided ASR output   * We ask for submissions which translate from **speech**, or **text-only** using provided ASR output
-  * We encourage systems with both **constrained** and **unconstrained** data conditions+  * We encourage systems with both **constrained** ​(Fisher only) and **unconstrained** ​(open) ​data conditions
 Submitted systems will ranked in terms of multiple automatic metrics including BLEU and METEOR. ​ Submitted systems will ranked in terms of multiple automatic metrics including BLEU and METEOR. ​
  
Line 38: Line 38:
 ==== Data ==== ==== Data ====
 This task uses the LDC Fisher Spanish speech (**disfluent**) with new target translations (**fluent**). ​ This task uses the LDC Fisher Spanish speech (**disfluent**) with new target translations (**fluent**). ​
-This dataset has **160 hours** of speech (138k utterances) ​for training: this is a smaller dataset than other tasks, which we hope some groups may find more approachable. ​+This dataset has **160 hours** of speech (138k utterances):​ this is a smaller dataset than other tasks, which we hope some groups may find more approachable. ​
  
 We provide multi-way parallel data for experimentation: ​ We provide multi-way parallel data for experimentation: ​
Line 51: Line 51:
 We have arranged an evaluation license agreement with the LDC where all participants may receive this data without cost for the purposes of this task. __license agreement__:​ {{ :​iwslt_2020_ldc_evaluation_agreement.pdf |}} We have arranged an evaluation license agreement with the LDC where all participants may receive this data without cost for the purposes of this task. __license agreement__:​ {{ :​iwslt_2020_ldc_evaluation_agreement.pdf |}}
  
-Participants should sign the license agreement and follow the directions in the pdf to return a signed copy to LDC (by [[ldc@ldc.upenn.edu|email]] or fax). Once received, LDC will provide a download link for the data package within 1-2 days. Participants who do not already have an LDC account will need to create one to download the data; the LDC membership office will assist with any questions.+Participants should sign the license agreement and follow the directions in the pdf to return a signed copy to LDC (by [[ldc@ldc.upenn.edu,​elizabeth.salesky+iwslt2020@gmail.com|email]] or fax). Once received, LDC will provide a download link for the data package within 1-2 days. Participants who do not already have an LDC account will need to create one to download the data; the LDC membership office will assist with any questions. Test data will be automatically distributed to the LDC accounts of participants who have registered for the training data
  
 To enable immediate participation,​ we provide preprocessed speech features, with mapped (parallel) speech and text (transcript and translation) utterances in the IWSLT data package. ​ To enable immediate participation,​ we provide preprocessed speech features, with mapped (parallel) speech and text (transcript and translation) utterances in the IWSLT data package. ​
Line 57: Line 57:
 We note that the original speech and translations require a mapping step to be made parallel, and we provide [[https://​github.com/​esalesky/​fisher-mapping|code]] to do so within the data package (further details in the data package README). This is only necessary if you wish to extract your own features. ​ We note that the original speech and translations require a mapping step to be made parallel, and we provide [[https://​github.com/​esalesky/​fisher-mapping|code]] to do so within the data package (further details in the data package README). This is only necessary if you wish to extract your own features. ​
  
-Participants ​who wish to use additional data beyond what is provided (**unconstrained**) ​must also submit systems which use only the data provided (**constrained**);​ constrained and unconstrained systems will be scored separately. ​+We strongly encourage participants ​who wish to use additional data beyond what is provided (**unconstrained**) ​to also submit systems which use only the Fisher ​data provided (**constrained**);​ constrained and unconstrained systems will be scored separately. We will also note which systems did not use the fluent references for training
  
 **__DATA RESTRICTION NOTE__**: Data from Fisher dev, dev2, and test splits and the Spanish Callhome dataset are //​__**not**__//​ permitted for model training. ​ **__DATA RESTRICTION NOTE__**: Data from Fisher dev, dev2, and test splits and the Spanish Callhome dataset are //​__**not**__//​ permitted for model training. ​
Line 70: Line 70:
  
 ==== Important Dates ==== ==== Important Dates ====
-All IWSLT 2020 tasks will follow the same dates+All IWSLT 2020 tasks will follow the same dates. Deadlines have been extended due to COVID-19 -- work from home, stay healthy and safe! 
  
 ^Evaluation Campaign ​       ^                                ^Evaluation Campaign ​       ^                               
 |January 2020: release of train and dev data    | |January 2020: release of train and dev data    |
-|March 2020: evaluation period ​  |    +|March ​17 2020: release of test data   |    
-|April 6th 2020: system description paper due  |   +|<​del>​March 31 2020</​del>​ April 20, 2020: submissions due   ​| ​   
-|May 4th 2020: review feedback ​ |                      ​+|<​del>​April 6th 2020</​del>​ April 24, 2020: system description paper due  |   
 +|<del>May 4th 2020</​del>​ May 11, 2020: review feedback ​ |                      ​
 |May 18th 2020: camera-ready paper due  |        |May 18th 2020: camera-ready paper due  |       
  
 ==== Submission ==== ==== Submission ====
 +We will provide test and challenge test input. We would like to see outputs for all test sets. 
 We expect submission format of plain text with one utterance per line, pre-formatted for scoring (//​lowercased,​ detokenized output with all punctuation except apostrophes removed//​). ​ We expect submission format of plain text with one utterance per line, pre-formatted for scoring (//​lowercased,​ detokenized output with all punctuation except apostrophes removed//​). ​
   * Participants must specify if their systems translate from **speech**, or **text-only**   * Participants must specify if their systems translate from **speech**, or **text-only**
-  * Participants must specify if their submission is **unconstrained** (use additional data beyond what is provided) or **constrained** (use only the data provided); constrained and unconstrained systems will be scored separately.  +  * Participants must specify if their submission is **unconstrained** (use additional data beyond what is provided) or **constrained** (use only the Fisher ​data provided); constrained and unconstrained systems will be scored separately.  
-Submissions should be sent to [[elizabeth.salesky+iwslt2020@gmail.com|elizabeth.salesky+iwslt2020@gmail.com]].+  * Participants should also note if they did **not** use the fluent references to train.  
 +Submissions should be compressed in a single .tar.gz file and sent to [[elizabeth.salesky+iwslt2020@gmail.com|elizabeth.salesky+iwslt2020@gmail.com]].