From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue Archives of Control Sciences Année : 2005

From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

Résumé

We present a robust full-featured architecture to preprocess text before parsing. This architecture, called SxPipe, converts raw noisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and un-/re-capitalization. Though our system currently deals with the French language, almost all components are in fact language-independent, and the others can be straightforwardly adapted to virtually any inflectional language. The output is a sequence of word lattices, all words being present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign and during experiments of large corpora parsing, showing both good efficiency and very satisfying precision and recall.
Fichier principal
Vignette du fichier
SagotBoullierACS05.pdf (72.51 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

inria-00521228 , version 1 (26-09-2010)

Identifiants

  • HAL Id : inria-00521228 , version 1

Citer

Benoît Sagot, Pierre Boullier. From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe. Archives of Control Sciences, 2005, Language and Technology. Human Language Technologies as a Challenge for Computer Science and Linguistics, 15 (4), pp.653-662. ⟨inria-00521228⟩

Collections

INRIA INRIA2
138 Consultations
125 Téléchargements

Partager

Gmail Facebook X LinkedIn More