Skip to Main content Skip to Navigation
Journal articles

From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

Abstract : We present a robust full-featured architecture to preprocess text before parsing. This architecture, called SxPipe, converts raw noisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and un-/re-capitalization. Though our system currently deals with the French language, almost all components are in fact language-independent, and the others can be straightforwardly adapted to virtually any inflectional language. The output is a sequence of word lattices, all words being present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign and during experiments of large corpora parsing, showing both good efficiency and very satisfying precision and recall.
Document type :
Journal articles
Complete list of metadata
Contributor : Benoît Sagot Connect in order to contact the contributor
Submitted on : Sunday, September 26, 2010 - 9:50:33 PM
Last modification on : Friday, January 21, 2022 - 3:17:49 AM
Long-term archiving on: : Monday, December 27, 2010 - 2:42:51 AM


Files produced by the author(s)


  • HAL Id : inria-00521228, version 1



Benoît Sagot, Pierre Boullier. From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe. Archives of Control Sciences, Polish Academy of Sciences, 2005, Language and Technology. Human Language Technologies as a Challenge for Computer Science and Linguistics, 15 (4), pp.653-662. ⟨inria-00521228⟩



Les métriques sont temporairement indisponibles