Construction conjointe d'un corpus et d'un classifieur pour les registres de langue en français

Abstract : Joint building of a corpus and a classifier for language registers in French. Language registers are an observable stylistic trait of texts and speeches. However, they are still poorly studied in natural language processing. In this paper, we present a semi-supervised approach which jointly builds a corpus of texts labeled in registers and an associated classifier. This approach is based on an initial and limited set of expert data. Using an massive automatically retrieved collection of web pages, it iteratively proceeds by alternating the learning of an intermediate classifier and the annotation of new texts to augment the labeled corpus. We apply this approach to formal, neutral, and informal registers. At the end of the process, the labeled corpus gathers 800, 000 texts, and the classifier, a neural network, has an accuracy of 87 %.
Complete list of metadatas

Cited literature [31 references]  Display  Hide  Download

https://hal.inria.fr/hal-02002601
Contributor : Gwénolé Lecorvé <>
Submitted on : Thursday, January 31, 2019 - 6:00:45 PM
Last modification on : Thursday, February 7, 2019 - 2:57:08 PM
Long-term archiving on : Wednesday, May 1, 2019 - 6:18:11 PM

File

registres_de_langue.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02002601, version 1

Citation

Gwénolé Lecorvé, Hugo Ayats, Benoît Fournier, Jade Mekki, Jonathan Chevelu, et al.. Construction conjointe d'un corpus et d'un classifieur pour les registres de langue en français. Traitement automatique du langage naturel (TALN), May 2018, Rennes, France. ⟨hal-02002601⟩

Share

Metrics

Record views

60

Files downloads

65