Host Laboratory : LIUM, Team LST,
Partner industry : SNCF Innovation Recherche
Tutor : Nicolas Dugué (LIUM)
Co-tutor : Nathalie Camelin (LIUM), Luce Lefeuvre (SNCF)
Duration : One-year contract, Satrting date : asap
LIUM is currently completing a collaboration project with SNCF’s Innovation and Research Department to structure a corpus of documents into themes. With the lexical resources provided by SNCF, LIUM became aware of the richness and specificities of the business vocabulary used within companies such as SNCF. This vocabulary is sometimes uncommon in corpuses but according to experts, it is very important for characterizing documents. In addition, this vocabulary contains acronyms which, for about 40%, are not used as abbreviations for the same groups of words.
The corpus of this project has allowed us to highlight three major scientific locks for the efficient automatic processing of this type of documents using lexical embeddings:
How to learn good quality embeddings for specific vocabulary that is sometimes uncommon?
How to learn embeddings for specific AND polysemic acronyms?
How to evaluate the embeddings learned?
The candidate must have a very good proficiency in French, since the documents and vocabulary are in French and it will be necessary to be able to analyze the results in this context.