Des arbres, des chevaliers et des marionnettes : apprentissages par transferts pour le traitement des langues historiques – Laboratoire d'Informatique de l'Université du Mans

Seminar from Loïc Grobol, Assistant Professor at Université Paris Nanterre

Date: 29/04/2022
Time: 11h00
Localization: IC2 Boardroom, online
Speaker: Loïc Grobol

Trees, knights and puppets: transfer learning for processing of historical languages

In recent years, automatic natural language processing (NLP) has evolved extremely rapidly, with NLP systems achieving record performance for many tasks and domains. These developments are largely due to the contributions of deep learning techniques, of which the most recent and impactful are based on the use of semi-supervised pre-training on large amounts of unannotated data complemented by targeted learning (fine-tuning) for target tasks (Peters et al., 2018 ; Howard et Ruder, 2018 ; Devlin et al., 2019). The main advantage of these techniques is that they allow the exploitation of massive data resulting from the omnipresent digitalization of language in all its forms. However, for many applications, the existence of such data is far from self-evident – whether in poorly endowed languages (Hedderic et al., 2021) or in poorly documented domains (Ramponi et Plank, 2021).

Historical languages, and in particular those representing ancient states of still existing and well-documented languages, are a particularly interesting case of this problem. Indeed, while the available data are often scarce, highly heterogeneous, and necessarily finite, their proximity to much better endowed languages makes it tempting to apply so-called transfer learning techniques to them: use resources (data and systems) developed for their well-endowed descendants, and use the data available for the former state to adapt those resources to it.

In this talk, I will present work done and still in progress in the framework of the PROFITEROLE project (PRocessing Old French Instrumented TExts for the Representation Of Language Evolution), which focuses on the use of heterogeneous resources for the syntactic analysis of medieval French. Our experiments show that it is possible to exploit resources for contemporary French (and in particular contextual representations of words) to significantly improve the processing of old French states using transfer learning techniques.

References

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, et Kristina Toutanova. « BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ». In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171‑86. Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/N19-1423.
Hedderich, Michael A., Lukas Lange, Heike Adel, Jannik Strötgen, et Dietrich Klakow. « A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios ». In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2545‑68. Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.naacl-main.201.
Howard, Jeremy, et Sebastian Ruder. « Universal Language Model Fine-tuning for Text Classification ». In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 328‑39. Association for Computational Linguistics, 2018. https://doi.org/10.18653/v1/P18-1031.
Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, et Luke Zettlemoyer. « Deep Contextualized Word Representations ». In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1:2227‑37. Association for Computational Linguistics, 2018. https://doi.org/10.18653/v1/N18-1202.
Ramponi, Alan, et Barbara Plank. « Neural Unsupervised Domain Adaptation in NLP—A Survey ». In Proceedings of the 28th International Conference on Computational Linguistics, 6838‑55. International Committee on Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.coling-main.603.