PhD defence, Salima Mdhaffar
Location: Université d’Avignon, videoconference
Title : Speech Recognition in the context of lectures: Evaluation, Progress and Enrichment
– Prof. Georges Linarès (Professeur, Université d’Avignon)
– Dr. Irina Illina (Maître de conférences HDR, Université de Nancy)
– Prof. Sylvain Meignier (Professeur, Le Mans Université)
– Dr. Olivier Galibert (Ingénieur de recherche, Laboratoire National de Métrologie et d’Essais)
– Dr. Camille Guinaudeau (Maître de conférences, Université de Paris Saclay)
– Prof. Yannick Estève (Professeur, Université d’Avignon)
– Dr. Antoine Laurent (Maître de conférences, Le Mans Université)
– Dr. Nicolas Hernandez (Maître de conférences, Université de Nantes)
– Dr. Solen Quiniou (Maître de conférences, Université de Nantes)
This thesis is part of a study that explores automatic transcription potential for the instrumentation of educational situations. Our contribution covers several axes. First, we describe the enrichment and the annotation of COCo dataset that we produced as part of the ANR PASTEL project. This corpus is composed of different lectures’ videos. Each lecture is related to a particular field (natural language, graphs, functions …). In this multi-thematic framework, we are interested in the problem of the linguistic adaptation of automatic speech recognition systems (ASR). The proposed language model adaptation is based both on the lecture presentation supports provided by the teacher and in-domain data collected automatically from the web. Then, we focused on the ASR evaluation problem. The existing metrics don’t allow a precise evaluation of the transcriptions’ quality. Thus, we proposed two evaluation protocols. The first one deals with an intrinsic evaluation, making it possible to estimate performance only for domain words of each lecture (IWER_Average). The second protocol offers an extrinsic evaluation, which estimates the performance for two tasks exploiting transcription: information retrieval and indexability.
Our experimental results show that the global word error rate (WER) masks the gain provided by language model adaptation. So, to better evaluate this gain, it seems particularly relevant to use specific measures, like those presented in this thesis. As LM adaptation is based on a collection of data from the web, we study the reproducibility of language model adaptation results by comparing the performances obtained over a long period of time. Over a collection period of one year, we were able to show that, although the data on the Web changed in part from one month to the next, the performance of the adapted transcription systems remained constant (i.e. no significant performance changes), no matter the period considered. Finally, we are intersted on thematic segmentation of ASR output and alignment of slides with oral lectures. For thematic segmentation, the integration of slide’s change information into the TextTiling algorithm provides a significant gain in terms of F-measure. For alignment of slides with oral lectures, we have calculated a cosine similarity between the TF-IDF representation of the transcription segments and the TF-IDF representation of text slides and we have imposed a constraint to respect the sequential order of the slides and transcription segments. Also, we have considered a confidence measure to discuss the reliability of the proposed approach.
Language model, transcription, evaluation, adaptation, automatic structuration