Skip to Main content Skip to Navigation
Conference papers

Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort

Pascal Denis 1 Benoît Sagot 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
UPD7 - Université Paris Diderot - Paris 7, Inria Paris-Rocquencourt
Abstract : This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve POS tagging performance. Focusing on French tagging, we introduce a maximum entropy conditional sequence tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.7% accuracy on the French Treebank, an error reduction of 23% (28% on unknown words) over the same tagger without lexical information. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data vs. developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.
Document type :
Conference papers
Complete list of metadata

https://hal.inria.fr/inria-00514366
Contributor : Pascal Denis Connect in order to contact the contributor
Submitted on : Thursday, September 2, 2010 - 9:03:19 AM
Last modification on : Thursday, February 11, 2021 - 2:38:02 PM

Identifiers

  • HAL Id : inria-00514366, version 1

Collections

`

Citation

Pascal Denis, Benoît Sagot. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. Pacific Asia Conference on Language, Information and Computation, 2009, Hong Kong, China. ⟨inria-00514366⟩

Share

Metrics

Record views

308