Polysemic Embeddings for Industry (PolysEmY)

Date: 01/2020 - 07/2021
Funding: RFI AltanStic 2020
Partners: SNCF (France)
URL: https://lium.univ-lemans.fr/polysemy

The lexical resources of SNCF’s technical documentation testify to the richness and specificities of the business vocabulary used within companies such as SNCF. This vocabulary is sometimes uncommon in corpuses, but according to experts, it is of major importance for the characterization of documents. Moreover, in the case of SNCF, this vocabulary contains acronyms which, for about 40%, are nor used as abbreviations for the same groups of words.

Through the study of this corpus, we have read three major scientific locks for the efficient automatic processing of this type of documents using embeddings

  1. How to learn good quality embeddings for specific vocabulary that is sometimes uncommon?
  2. How to learn embeddings for specific AND polysemic acronyms?
  3. How to evaluate the embeddings learned?