Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Pascal Denis 1 Benoît Sagot 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve POS tagging performance. Focusing on French tagging, we introduce a maximum entropy conditional sequence tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.7% accuracy on the French Treebank, an error reduction of 23% (28% on unknown words) over the same tagger without lexical information. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data vs. developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.
Type de document :
Article dans une revue
Language Resources and Evaluation, Springer Verlag, 2012, 46 (4), pp.721-736. 〈10.1007/s10579-012-9193-0〉
Liste complète des métadonnées

Littérature citée [28 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00614819
Contributeur : Pascal Denis <>
Soumis le : vendredi 20 juillet 2012 - 10:40:25
Dernière modification le : jeudi 15 novembre 2018 - 20:27:26
Document(s) archivé(s) le : dimanche 21 octobre 2012 - 02:20:17

Fichier

lre12-denis-sagot.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Pascal Denis, Benoît Sagot. Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging. Language Resources and Evaluation, Springer Verlag, 2012, 46 (4), pp.721-736. 〈10.1007/s10579-012-9193-0〉. 〈inria-00614819〉

Partager

Métriques

Consultations de la notice

618

Téléchargements de fichiers

492