Automatic structuring of audio documents

Starting: 01/10/2012
PhD Student: Abdessalam Bouchekif
Advisor(s): Yannick Estève
Co-advisor(s): Nathalie Camelin
Funding: Contrat Orange Labs

The topic structuring is an area that has attracted much attention in the Natural Language Processing community. Indeed, topic structuring is considered as the starting point of several applications such as information retrieval, summarization and topic modeling.

In this thesis, we proposed a generic topic structuring system i.e. that has the ability to deal with any TV Broadcast News.

Our system contains two steps: topic segmentation and title assignment. Topic segmentation consists in splitting the document into thematically homogeneous fragments. The latter are generally identified by anonymous labels and the last step has to assign a title to each segment.

Several original contributions are proposed like the use of a joint exploitation of the distribution of speakers and words (speech cohesion) and also the use of diachronic semantic relations. After the topic segmentation step, the generated segments are assigned a title corresponding to an article collected from Google News during the same day. Finally, we proposed the evaluation of two new metrics, the first is dedicated to the topic segmentation and the second to title assignment.

The experiments are carried out on three corpora. They consisted of 168 TV Broadcast News from 10 French channels automatically transcribed. Our corpus is characterized by his richness and diversity.