=============================================================================

        FrNewsLink Package: french resources for evaluating topic   
        segmentation and titling tasks

==============================================================================
  
This file describes in a synthetic way the FrNewsLink package that allows to adress several applicative tasks in the domain of topic and titling segmentation. It is compososed  of a set of resources from a varied corpus of French Broadcast News (BN) and press articles. Due to broadcasting rights, this package does not contain videos or audios files.

The corpus offers both automatic transcriptions of BNs and press articles from the web, all collected during the same period of time so that newspapers and articles deal with the same news topics.

FrNewsLink's resources are based on 86 BN shows recorded during the 7th week of 2014 (February 10-16) and 26 BN shows recorded on 26 and 27 January 2015. The first period will be referenced by W07_14 and the second period will be referenced by W05_15.

These shows come from 8 different channels, with a total of 14 different programs since some channels offer several news shows during the day.

Thus, depending on the day, the following shows will be available: 
    - Arte:  News show
    - D8: Le JT
    - Euronews
    - France 2, News show from: 7h, 8h, 13h, 20h
    - France 3: Le 12-13, Le 19-20
    - M6: Le 12-45, Le 19-45
    - NT1: Les infos
    - TF1, News show from: 13h, 20h

During the same period, news web articles appearing on the main page of Google News were downloaded in our database every hour. Thus, 28 709 entries were registered with an average of 2.7k items per day. Several articles remain more than an hour on the website and are therefore registered several times in our database. When the duplicate items are discarded, 22 141 items remain.
Google News presents its articles in the form of "thematic clusters" with a main article highlighted for each of the clusters. Only these main articles, about 590 per day, were considered for topic titling.
  
0. References and citation
=========================
This corpus was published at the LREC 2018 conference. More information and a set of statistics can be found in the article.

Please cite this paper if your work is on FrNewsLink:

@inproceedings{CamelinLREC18,
  TITLE = {{FrNewsLink : a corpus linking TV Broadcast News Segments and Press Articles}},
  AUTHOR = {Camelin, Nathalie and Damnati, Géraldine and Bouchekif, Abdessalam and Landeau, Anais and Charlet, Delphine and Est{\`e}ve, Yannick},
  BOOKTITLE = {{LREC 2018}},
  ADDRESS = {Miyazaki, Japan},
  YEAR = {2018},
  MONTH = May,
  KEYWORDS = {content linking, semantic textual similarity, topic segmentation}
}


1. Package contents and version
===============================

The package is structured as follows:

<ROOT_DIR>
  |----README_EN.TXT            english documentation
  |----README_FR.TXT            french documentation
  |----RELEASE.TXT              version
  |----<W0X_XX>                 2 directories : W07_14 et W05_15, each containing the following data:
          |----<TV_W0X_XX>                     directory containing resources from BN shows
          |      |-----*.ctm                        automatic transcription 
          |      |-----*.stm                        automatic transcription organized according to automatic diarization
          |----<PRESS_W0X_XX>                       directory containing resources from web press articles
          |      |-----bdd_20XX_XX_XX_articles.csv      database 
          |      |----<20XX_XX_XX>                  one directory per day
          |      |      |-----*.txt                     text content of highlighted web articles
          |----<ANNOTATIONS>     directory containing manual annotations
          |      |-----*.trs        manual topic segmentation             
          |      |-----*.assgn      manual linking annotation between BN segments and web articles             


== Version
This package is version 00, available on 1/11/2017 on the LIUM website: https://lium.univ-lemans.fr/frnewslink/


2. Files Formats
================

 == ctm files, automatic transcriptions
Each line corresponds to the hypothesis of a word uttered in a broadcat news. For each word hypothesis, we associate a set of fields, each separated by a space, which represents the following informations: 
document_name  channel (default value 1)   start_time  word_duration   word    confidence measure 
 
 == stm files, automatic transcriptions according to automatic diarization
Words of the ctm files are grouped according to the breath groups. For each breath group, we associate a set of fields, each separated by a tabulation, which represent the following information: 
document_name 1 speaker start_time end_time <speaker_characteristics> list of word hypotheses
Note that speaker clustering has been done file by file independently. Thus, the S0 speaker of file 1 does not represent the same person as the S0 speaker of file 2.

 == csv files, web articles database
For each day of broadcast news records, the database containing informations for each Google News article published each hour the same day as the broadcast news is available. Google News presents these articles by "thematic clusters" and for each cluster highlights an article. We note this article as the main one. 
For each article, the following information is available: 
identifier; id_category; id_site; id_main_article; date_time; url; title

 = identifier : the identifier enables to find the news articles in the day directory as identifier.txt
 Note that only main articles are available

 = id_category corresponds to identifier of the news category proposed by Google News web site

 = id_site corresponds to the press web site identifier in our database

 = id_main_article : equal to 0 if the article is the main one, otherwise equal to the identifier of the corresponding main article
 
 = date_time correspond to the date DAY/MONTH/YEAR HOUR:MINUT on which the article information has been stored in our database
 
 = url and title : url and title of the press article indicated on Google News page.


 == txt files, the useful textual content of press articles
 In each file identifier.txt, the useful textual content extracted by Boilerpipe is available in addition of meta-informations : 
    TITRE: [...]    -- title
    DATE: [...]     
    URL: [...]
    PRINCIPAL: 0    -- main article 
    TEXT:
    [...]


 == trs files, manual topic segmentation
 These xml files can be opened with the transcriber software. 

 = <Section> : boundary of topic segment 

 = <Topics> : list of topics is indicated at the beginning of the file. Note that each topic is different from one file to another even if it indicates the same name.
 
 = <Turn> : Manual brief description of the segment content


 == assgn files, linking annotation between BN segments and web articles
This files indicates the type of link between segments and web articles. For each BN files, three set of press articles identifier is associated to each segment.
On each row, the following information are indicated :
start_time:end_time; list of associated press articles identifiers; list of possibly associated press articles identifiers;list of out ot topic press articles identifiers.

For more information on the manual linking protocol, see [CamelinLREC18].


Contact point:
==============
Nathalie Camelin
Assistant Professor, Le Mans University, LIUM
https://lium.univ-lemans.fr
mail:nathalie.camelin@univ-lemans.fr
tel: +33 1 43 13 33 33
fax: +33 1 43 13 33 30