Instruction Following track
📢 Announcement: Submissions are now open!
Description
Motivation
Large language models (LLMs) have demonstrated the capability of performing several NLP tasks without the need for building dedicated models, offering a single solution for many applications. While solely processing text in their initial stage, LLMs are now being enhanced by shifting towards the integration of other modalities, like vision and audio. In this scenario, the emerging paradigm of creating a unique architecture from speech foundation models (SFMs) and LLMs is gaining traction to combine the best of both worlds: the ability to process spoken language inputs with the always-evolving language knowledge of the LLMs. For this reason, and given the success of the first edition, we propose the second edition of the Instruction Following (IF) task, which is aimed at testing general models for the speech modality, which better reflects the current trends in the research community.
Tasks Description
Participants are asked to build a model capable to perform, depending on the track, the following tasks:
- SHORT TRACK (input: automatically segmented audio):
- Automatic Speech Recognition (ASR): the speech is transcribed into the same language;
- Speech-to-text Translation (S2TT): the speech is translated into the target language;
- Spoken Question Answering (SQA): textual questions have to be answered based on the spoken content in the same language and in a language different from the speech (questions and answers are always in the same language);
- [NEW THIS YEAR!] Surprisal: a task that is unknown at submission time but doable through in-context learning abilities of SpeechLLMs.
- LONG TRACK (input: long-form audio):
- All the short-form tasks, including the Surprisal;
- Speech-to-text Summarization (S2TSUM): a summary has to be provided from the spoken content in the same language and in a language different from the speech;
- [NEW THIS YEAR!] Audio Chaptering (ACHAP): the spoken content has to be segmented into coherent sections, each labeled with a concise title summarizing its topic.
All tasks listed for each track are mandatory.
Languages
English for ASR, monolingual SQA, ACHAP, and S2TSUM, and English -> German, Italian, Chinese for S2TT, multilingual SQA, ACHAP, and S2TSUM. English -> German, Chinese for the SURPRISAL.
IMPORTANT! The results can be submitted for some or all language directions.
Data Conditions
We adopt two conditions. The first is constrained, where a pre-defined training setting is adopted, and only a specified pre-trained SFM and LLM architecture can be used. The second is unconstrained, with no limitation on pre-trained models and training data.
Constrained
Participants are allowed to use the SFM and LLM provided below, and training the system on the following data.
- Pre-trained Models:
- Training Data:
- Validation Data:
- ASR/S2TT/SQA/S2TSUM: MCIF (including the IWSLT25 Instruction Following test set)
- ACHAP: YTSeg
We do not provide any training data for SQA, ACHAP, and S2TSUM in languages different from the source speech.
IMPORTANT! The use of the pre-trained SFM and/or LLM is NOT mandatory and we also accept submissions with models trained from scratch on the allowed data as well as solutions using only one of the two pre-trained models (either SFM or LLM).
Unconstrained
Any model, any data.
Evaluation
We release the video, the source audio, and the instructions, and participants submit their outputs. The instructions can be modified by participants to match their system’s prompts. For cross-lingual tasks, the output language should be the one used in the prompt. For instance, in SQA, questions are provided both in the same language of the speech (English) and in different languages (German, Italian, Chinese) but they always have to be replied to in the same language of the questions (e.g., an Italian question should be replied to in Italian). Questions can also be nonanswerable, in this case, only the answer “Not answerable.” (and the corresponding Italian “Non è possibile rispondere.”, German “Nicht zu beantworten.”, and Chinese â€ść— ćł•ĺ›žç”。” translations) will be considered correct.
The Long Track will process audio files in WAV format that are, on average, 6 minutes long. The Short Track will handle the same audio files, but they will be automatically segmented into 15–20 second audio segments, on average, using SHAS.
An example of the input format for the Long track is downloadable here. Participants are also allowed to use it as 1-shot example for their model.
The expected output format will be same of MCIF, please see the GitHub repository. An example of the output format (including ACHAP) for the Long track is downloadable here.
We also provide useful scripts for parsing inputs and outputs, downloadble here.
Evaluation is conducted using the MCIF GitHub repository for all tasks.
Submission
The submission will be performed using the Meetween SPEECHM Evaluation Server.
General Guidelines
-
Create a single instance for each model, report the required information, including the training condition (constrained/unconstrained), and upload the outputs for one or more target languages (English -> English, German, Italian, Chinese). Participants must provide the information required in the “Description” field, if no information is provided, the most permissive condition will be considered (e.g., unconstrained over constrained or the usage of all training data available over using CoVoST2 and/or GigaST only)
-
Multiple submissions are allowed, but participants must explicitly indicate one PRIMARY run for each track. All other run submissions are treated as CONTRASTIVE runs. In the case that none of the runs is marked as PRIMARY, the latest submission (according to the file timestamp) for the respective track will be used as the PRIMARY run.
-
Outputs should follow the xml format specified in the Evaluation section.
-
If any issues are identified, the submitted outputs can be deleted or replaced with newer ones.
Submission Steps
Available after the Evaluation period start date.
Once logged in to SPEECHM Evaluation Server, proceed through the following steps.
STEP 1: Download Test Data
- Click on
Test sets(at the top of the page), and select eitherIFLONG26orIFSHORT26depending on the track (long and short, respectively). Alternatively, directly access the IF task page. - Download the test set (containing audios and XMLs with instructions) of the selected target language.
STEP 2: Create a New Model
- Click on
My submissions(at the top of the page) and onNew model(button at the top right). - Create a new model:
- Insert the
Nameusing the standardized format:${TEAM}_IWSLT26_IF_${TRACK}_${CONDITION}_${SUBMISSION_TYPE} Where: - ${TEAM} → Short name of your team (e.g., KIT) - ${TRACK} → Choose from [SHORT, LONG] - ${CONDITION} → Choose from [constrained, unconstrained] - ${SUBMISSION_TYPE} → Choose from [primary, contrastive] Example Model Names: KIT_IWSLT26_IF_SHORT_constrained_primary KIT_IWSLT26_IF_SHORT_constrained_contrastive1 KIT_IWSLT26_IF_SHORT_constrained_contrastive2 - Insert
Descriptionby including: ```- Data conditions: constrained/unconstrained
- [if constrained] Training data: with CoVoST2/with GigaST/with CoVoST2 and GigaST/other combination (specify)
- Model architecture: cascade/direct
- Any other relevant features characterizing your approach: ```
- Select
Task idsby checkingInstruction Following (IF) - Check
Consent(optional) to freely release your submitted system output data, including for human evaluation purposes - Click on
Create Model(button at the bottom right)
- Insert the
STEP 3: Submit the Outputs
- Click on
My submissions - Click on the model created in STEP 2 (e.g.,
KIT_IWSLT26_IF_SHORT_constrained_primary) - Click on
IF Hypotheses, close toModel info - Upload your XML file by clicking on the
Upload hypothesisbutton corresponding to the language pair(s) you want to participate in
Manage Your Model
Download or Delete the Hypothesis
- Click on
My submissions - Click on the model created in STEP 2 (e.g.,
KIT_IWSLT26_IF_SHORT_constrained_primary) - Click on
IF Hypotheses, close toModel info - Use the three-dot menu on the right to either
DownloadorDeletethe submitted hypothesis
Delete a Model
You can, at any time, change the name and description of your model by clicking on its name under the My submissions panel. If you want to delete a model (i.e., not replacing or modifying it, but completely removing it from participating models), please contact the task’s organizers.
Organizers
- Sara Papi, Fondazione Bruno Kessler
- Luisa Bentivogli, Fondazione Bruno Kessler
- Marco Gaido, Fondazione Bruno Kessler
- Danni Liu, Karlsruhe Institute of Technology
- Fabian Retkowski, Karlsruhe Institute of Technology
- Beatrice Savoldi, Fondazione Bruno Kessler
- Maike ZĂĽfle, Karlsruhe Institute of Technology
Contact
Chair(s): Sara Papi sara95papi@gmail.com;
Discussion: iwslt-evaluation-campaign@googlegroups.com