Polysemic embeddings

Host Laboratory : LIUM, Team LST, Partner industry : SNCF Innovation Recherche Tutor : Nicolas Dugué (LIUM) Co-tutor : Nathalie Camelin (LIUM), Luce Lefeuvre (SNCF) Duration : One-year contract, Satrting date : asap
French mandatory
LIUM is currently completing a collaboration project with SNCF's Innovation and Research Department to structure a corpus of documents into themes. With the lexical resources provided by SNCF, LIUM became aware of the richness and specificities of the business vocabulary used within companies such as SNCF. This vocabulary is sometimes uncommon in corpuses but according to experts, it is very important for characterizing documents. In addition, this vocabulary contains acronyms which, for about 40%, are not used as abbreviations for the same groups of words. The corpus of this project has allowed us to highlight three major scientific locks for the efficient automatic processing of this type of documents using lexical embeddings: How to learn good quality embeddings for specific vocabulary that is sometimes uncommon? How to learn embeddings for specific AND polysemic acronyms? How to evaluate the embeddings learned? For more details, please refer to French version ► Read more