Speech Translation Metrics track
Description
Speech translation has been a core focus of IWSLT for years, yet its evaluation remains underexplored. Most existing evaluations assume gold segmentation, an unrealistic scenario for real-world systems. When defering to automatically segmented speech, conventional text-to-text metrics become less reliable. Despite this, current evaluation practices still rely heavily on these metrics, highlighting the need for more robust and realistic assessment approaches.
This shared task focuses on Quality Estimation for speech translation, a reference-free evaluation of speech translation quality. Participants will assess the quality of translations produced in other IWSLT shared tasks, and system outputs will be evaluated based on their correlation with human judgments.
We consider two angles for speech translation quality estimation:
- Speech sample + system translation → score
- ASR transcript + system translation → score
To evaluate the submissions, we compute correlations with human judgmnets of quality. We encourage participation in both scenarios or exploring other approaches. The translation segments occur in documents which can also provide additional context to the quality estimation.
We look forward to your submissions!
Data
As an example input, consider the following audio:
and the corresponding testset entry:
{
"src_wav": "sample.wav"
"src_asr": "Plans are well underway for races to Mars and the Moon in 1992, by solar sail. The race to Mars is to commemorate Columbus's journey to the New World 500 years ago, and the one to the Moon is to promote the use of solar sails in space exploration.",
"tgt": "Pläne sind gut im Wege für Rennen nach Mars und der Mond in 1992, mit Sonnensegel. Das Rennen zum Mars ist zu Kolumbus' Reise in die Neue Welt vor 500 Jahren zu gedenken, und der eine zum Mond ist den Gebrauch von Sonnensegeln in Weltraum Exploration zu fördern.",
"score_human": 71.5,
}
The goal is to predict score_human given src_wav, src_asr, and tgt.
The human score is not 100 because the style of the automatic translation is awkward at places.
Train and development data will be released on January 1.
Baselines
We consider the following quality estimation baselines:
- ASR-based COMETKiwi-22
- ASR-based COMET-partial
- SpeechQE
- More baseline to be announced
Submission
More details on the submission process and timeline will be released soon.
As part of the submission, we require a system description paper to be submitted to IWLST to be reviewed.
Evaluation
Quality Estimation models will be evaluated by measuring their correlation with human judgments similar to WMT Metrics Shared Task. For each language pair, we compute:
- Kendall’s Taub: segment-level measure, akin to Pearson correlation groupped by item. This measures the ability of metrics to select the best translation given a single source.
- Soft Pairwise Accuracy: system-level measure. This reveals how good the metric is at ranking the participating systems.
We will provide evaluations scripts to verify dev data performance on January 1.
Organizers
- Maike Züfle, Karlsruhe Institute of Technology, maike.zuefle@kit.edu
- Vilém Zouhar, ETH Zurich, vzouhar@ethz.ch
- Brian Thompson
- Dominik Macháček, Charles University
- Mattias Sperber, Apple
- Marine Carpuat, University of Maryland
- HyoJung Han, University of Maryland
- Marco Turchi, Zoom
- Matteo Negri, FBK
Contact
Chair(s): Maike Züfle maike.zuefle@kit.edu; Vilém Zouhar vzouhar@ethz.ch
Discussion: iwslt-evaluation-campaign@googlegroups.com