Sahar Ghannay – Laboratoire d'Informatique de l'Université du Mans

This thesis concerns a study of continuous word representations applied to the automatic detection of speech recognition errors. Recent advances in the field of speech processing have led to significant improvements in speech recognition performances. However, recognition errors are still unavoidable. This reflects their sensitivity to the variability, e.g. to acoustic conditions, speaker, language style, etc. Our study focuses on the use of a neural approach to improve ASR error detection, using word embeddings. These representations have proven to be a great asset in various natural language processing tasks (NLP).

The exploitation of continuous word representations is motivated by the fact that ASR error detection consists on locating the possible linguistic or acoustic in- congruities in automatic transcriptions. The aim is therefore to find the appropriate word representation which makes it possible to capture pertinent information in order to be able to detect these anomalies. Our contribution in this thesis concerns several initiatives. First, we start with a preliminary study in which we propose a neural architecture able to integrate different types of features, including word embeddings.

Second, we propose a deep study of continuous word representations. This study focuses on the evaluation of different types of linguistic word embeddings and their combination in order to take advantage of their complementarities. On the other hand, it focuses on acoustic embeddings. The proposed approach relies on the use of a convolution neural network to build acoustic signal embeddings, and a deep neural network to build acoustic word embeddings. In addition, we propose two approaches to evaluate the performance of acoustic word embeddings. We also pro- pose to enrich the word representation, in input of the ASR error detection system, by prosodic features in addition to linguistic and acoustic embeddings. Integrating this information into our neural architecture provides a significant improvement in terms of classification error rate reduction in comparison to a conditional random field (CRF) based state-of-the-art approach.

Then, we present a study on the analysis of classification errors, with the aim of perceiving the errors that are difficult to detect. Perspectives for improving the performance of our system are also proposed, by modelling the errors at the sen- tence level. Finally, we exploit the linguistic and acoustic embeddings as well as the information provided by our ASR error detection system in several downstream applications.

Event Understanding through Multimodal Social Stream Interpretation