============================================================================= FrNewsLink Package: french resources for evaluating topic segmentation and titling tasks ============================================================================== This file describes in a synthetic way the FrNewsLink package that allows to adress several applicative tasks in the domain of topic and titling segmentation. It is compososed of a set of resources from a varied corpus of French Broadcast News (BN) and press articles. Due to broadcasting rights, this package does not contain videos or audios files. The corpus offers both automatic transcriptions of BNs and press articles from the web, all collected during the same period of time so that newspapers and articles deal with the same news topics. FrNewsLink's resources are based on 86 BN shows recorded during the 7th week of 2014 (February 10-16) and 26 BN shows recorded on 26 and 27 January 2015. The first period will be referenced by W07_14 and the second period will be referenced by W05_15. These shows come from 8 different channels, with a total of 14 different programs since some channels offer several news shows during the day. Thus, depending on the day, the following shows will be available: - Arte: News show - D8: Le JT - Euronews - France 2, News show from: 7h, 8h, 13h, 20h - France 3: Le 12-13, Le 19-20 - M6: Le 12-45, Le 19-45 - NT1: Les infos - TF1, News show from: 13h, 20h During the same period, news web articles appearing on the main page of Google News were downloaded in our database every hour. Thus, 28 709 entries were registered with an average of 2.7k items per day. Several articles remain more than an hour on the website and are therefore registered several times in our database. When the duplicate items are discarded, 22 141 items remain. Google News presents its articles in the form of "thematic clusters" with a main article highlighted for each of the clusters. Only these main articles, about 590 per day, were considered for topic titling. 0. References and citation ========================= This corpus was published at the LREC 2018 conference. More information and a set of statistics can be found in the article. Please cite this paper if your work is on FrNewsLink: @inproceedings{CamelinLREC18, TITLE = {{FrNewsLink : a corpus linking TV Broadcast News Segments and Press Articles}}, AUTHOR = {Camelin, Nathalie and Damnati, GĂ©raldine and Bouchekif, Abdessalam and Landeau, Anais and Charlet, Delphine and Est{\`e}ve, Yannick}, BOOKTITLE = {{LREC 2018}}, ADDRESS = {Miyazaki, Japan}, YEAR = {2018}, MONTH = May, KEYWORDS = {content linking, semantic textual similarity, topic segmentation} } 1. Package contents and version =============================== The package is structured as follows: |----README_EN.TXT english documentation |----README_FR.TXT french documentation |----RELEASE.TXT version |---- 2 directories : W07_14 et W05_15, each containing the following data: |---- directory containing resources from BN shows | |-----*.ctm automatic transcription | |-----*.stm automatic transcription organized according to automatic diarization |---- directory containing resources from web press articles | |-----bdd_20XX_XX_XX_articles.csv database | |----<20XX_XX_XX> one directory per day | | |-----*.txt text content of highlighted web articles |---- directory containing manual annotations | |-----*.trs manual topic segmentation | |-----*.assgn manual linking annotation between BN segments and web articles == Version This package is version 00, available on 1/11/2017 on the LIUM website: https://lium.univ-lemans.fr/frnewslink/ 2. Files Formats ================ == ctm files, automatic transcriptions Each line corresponds to the hypothesis of a word uttered in a broadcat news. For each word hypothesis, we associate a set of fields, each separated by a space, which represents the following informations: document_name channel (default value 1) start_time word_duration word confidence measure == stm files, automatic transcriptions according to automatic diarization Words of the ctm files are grouped according to the breath groups. For each breath group, we associate a set of fields, each separated by a tabulation, which represent the following information: document_name 1 speaker start_time end_time list of word hypotheses Note that speaker clustering has been done file by file independently. Thus, the S0 speaker of file 1 does not represent the same person as the S0 speaker of file 2. == csv files, web articles database For each day of broadcast news records, the database containing informations for each Google News article published each hour the same day as the broadcast news is available. Google News presents these articles by "thematic clusters" and for each cluster highlights an article. We note this article as the main one. For each article, the following information is available: identifier; id_category; id_site; id_main_article; date_time; url; title = identifier : the identifier enables to find the news articles in the day directory as identifier.txt Note that only main articles are available = id_category corresponds to identifier of the news category proposed by Google News web site = id_site corresponds to the press web site identifier in our database = id_main_article : equal to 0 if the article is the main one, otherwise equal to the identifier of the corresponding main article = date_time correspond to the date DAY/MONTH/YEAR HOUR:MINUT on which the article information has been stored in our database = url and title : url and title of the press article indicated on Google News page. == txt files, the useful textual content of press articles In each file identifier.txt, the useful textual content extracted by Boilerpipe is available in addition of meta-informations : TITRE: [...] -- title DATE: [...] URL: [...] PRINCIPAL: 0 -- main article TEXT: [...] == trs files, manual topic segmentation These xml files can be opened with the transcriber software. =
: boundary of topic segment = : list of topics is indicated at the beginning of the file. Note that each topic is different from one file to another even if it indicates the same name. = : Manual brief description of the segment content == assgn files, linking annotation between BN segments and web articles This files indicates the type of link between segments and web articles. For each BN files, three set of press articles identifier is associated to each segment. On each row, the following information are indicated : start_time:end_time; list of associated press articles identifiers; list of possibly associated press articles identifiers;list of out ot topic press articles identifiers. For more information on the manual linking protocol, see [CamelinLREC18]. Contact point: ============== Nathalie Camelin Assistant Professor, Le Mans University, LIUM https://lium.univ-lemans.fr mail:nathalie.camelin@univ-lemans.fr tel: +33 1 43 13 33 33 fax: +33 1 43 13 33 30