From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

Abstract : We present a robust full-featured architecture to preprocess text before parsing. This architecture, called SxPipe, converts raw noisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and un-/re-capitalization. Though our system currently deals with the French language, almost all components are in fact language-independent, and the others can be straightforwardly adapted to virtually any inflectional language. The output is a sequence of word lattices, all words being present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign and during experiments of large corpora parsing, showing both good efficiency and very satisfying precision and recall.
Document type :
Journal articles
Complete list of metadatas

https://hal.inria.fr/inria-00521228
Contributor : Benoît Sagot <>
Submitted on : Sunday, September 26, 2010 - 9:50:33 PM
Last modification on : Thursday, August 29, 2019 - 2:24:09 PM
Long-term archiving on : Monday, December 27, 2010 - 2:42:51 AM

File

SagotBoullierACS05.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00521228, version 1

Collections

Citation

Benoît Sagot, Pierre Boullier. From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe. Archives of Control Sciences, Polish Academy of Sciences, 2005, Language and Technology. Human Language Technologies as a Challenge for Computer Science and Linguistics, 15 (4), pp.653-662. ⟨inria-00521228⟩

Share

Metrics

Record views

196

Files downloads

315