This paper describes the development of a multilingual and multigenre manually annotated speech dataset, freely available to the research community as ground truth for the evaluation of automatic transcription systems and spoken language translation systems. The dataset includes two video genres—television broadcast news and talk-shows—and covers Flemish, English, German, and Italian, for a total of about 35 h of television speech. Besides segmentation and orthographic transcription, we added a very rich annotation on the audio signal, both at the linguistic level (e.g. filled pauses, pronunciation errors, disfluencies, speech in a foreign language) and at the acoustic level (e.g. background noise and different types of non-speech events). Furthermore, a subset of the transcriptions is translated in four directions, namely Flemish to English, German to English, German to Italian and English to Italian. The development of this dataset was organized in several phases, relying on expert transcribers as well as involving non-expert contributors through crowdsourcing. We first conducted a feasibility study to test and compare two methods for crowdsourcing speech transcription on broadcast news data. These methods are based on different transcription processes (i.e. parallel vs. iterative) and incorporate two different quality control mechanisms. With both methods, we achieved near-expert transcription quality—in terms of word error rate—for English, German and Italian data. Instead, for Flemish data we were not able to get a sufficient response from the crowd to complete the offered transcription tasks. The results obtained demonstrate that the viability of methods for crowdsourcing speech transcription significantly depends on the target language. This paper provides a detailed comparison of the results obtained with the two crowdsourcing methods tested, describes the main characteristics of the final ground truth resource created as well as the methodology adopted, and the guidelines prepared for its development.
- Multilingual and multigenre corpus, Crowdsourcing, Speech transcription, Automatic speech recognition