New paradigms for simultaneous speech translation leveraging paralinguistic and linguistic knowledge

Now as spoken language translation (SLT) systems are becoming more mature thanks to new technologies and advances, there is an opportunity for the SLT community to focus on more challenging scenarios and problems beyond core quality. Current approaches to simultaneous and offline speech translation often blindly rely on large quantities of heterogeneous training data to learn high-quality models to support users at inference time. This “the larger, the better” mindset obscures the need to incorporate specific and targeted knowledge to address particular aspects of the translation process.

This special session focuses on raising attention within the SLT community on two different types of information that are fundamental to boost performance in speech translation applications:

  • Paralinguistic information: Important facets of communication are non-verbal and non-linguistic aspects of speech. For instance, human beings naturally communicate their underlying emotional states without explicitly describing them. The capability to leverage paralinguistic information (e.g. tones, emotions) in the source language speech has been lost in most current SLT approaches.
  • Linguistic information: Current speech translation models do not take advantage of specific source/target language knowledge such as syntax parsers, morphological analyzers, monolingual and bilingual glossaries, ontologies, knowledge bases, etc. This is particularly evident by the incapabilities of the models to correctly translate parts of the input that are rarely represented in the training data (e.g. named entities, terms) or are specific to some languages (e.g. idioms).

This special session will cover simultaneous and incremental ASR, MT, TTS models, giving particular importance to their needs and uses in real-time application scenarios. By combining these themes, this session will create a positive environment that brings the wider speech and translation communities together to discuss innovative ideas, challenges, and opportunities for utilizing paralinguistic and linguistic knowledge within the scope of speech translation.

Call for papers

The special session will solicit contributions from different sister communities that we may not otherwise hear from in a venue for speech and audio processing to enrich community discussion on all aspects of simultaneous speech translation. We also welcome papers whose motivations, contributions, or implications highlight issues related to this space that is not commonly addressed at Interspeech, for instance, the use and creation of linguistic resources such as incremental parsers, interpretation corpora, para-linguistic annotations, and knowledge bases. Moreover, the special session encourages submissions that go beyond their technical or empirical contributions and also elaborate on how the work relates to the big picture of simultaneous speech translation and its application in production settings (e.g. efficiency, carbon footprint).

Papers for the Special Session should have the same format, and be submitted by the same deadline, as regular papers: Wednesday, March 1, 2023, 23:59, Anywhere on Earth. More information (including the submission link) about the paper submission is available here


  • Satoshi Nakamura, NAIST
  • Marco Turchi, Zoom Video Communications
  • Juan Pino, Meta
  • Marcello Federico, AWS AI Labs
  • Colin Cherry, Google
  • Alex Waibel, CMU/KIT
  • Elizabeth Salesky, Johns Hopkins University