Corpus: Topic Segmentation (FrNewsLink)

URL: https://hal.archives-ouvertes.fr/hal-01741177

FrNewsLink package allows to adress several applicative tasks in the domain of topic and titling segmentation. It is compososed of a set of resources from a varied corpus of French Broadcast News (BN) and press articles. Due to broadcasting rights, this package does not contain videos or audios files.

The corpus offers both automatic transcriptions of BNs and press articles from the web, all collected during the same period of time so that newspapers and articles deal with the same news topics.

FrNewsLink’s resources are based on 86 BN shows recorded during the 7th week of 2014 (February 10-16) and 26 BN shows recorded on 26 and 27 January 2015. The first period will be referenced by W07_14 and the second period will be referenced by W05_15.

These shows come from 8 different channels, with a total of 14 different programs since some channels offer several news shows during the day.

Thus, depending on the day, the following shows will be available:

– Arte: News show
– D8: Le JT
– Euronews
– France 2, News show from: 7h, 8h, 13h, 20h
– France 3: Le 12-13, Le 19-20
– M6: Le 12-45, Le 19-45
– NT1: Les infos
– TF1, News show from: 13h, 20h

During the same period, news web articles appearing on the main page of Google News were downloaded in our database every hour. Thus, 28 709 entries were registered with an average of 2.7k items per day. Several articles remain more than an hour on the website and are therefore registered several times in our database. When the duplicate items are discarded, 22 141 items remain.

Google News presents its articles in the form of “thematic clusters” with a main article highlighted for each of the clusters. Only these main articles, about 590 per day, were considered for topic titling.

 

FrNewsLink resources : [Download here]

README_EN.txt