From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

Benoît Sagot; Pierre Boullier

Article Dans Une Revue Archives of Control Sciences Année : 2005

From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

(1) , (1)

Benoît Sagot

Fonction : Auteur
PersonId : 1461
IdHAL : bsagot
ORCID : 0000-0002-0107-8526
IdRef : 177454229

Software tools for natural language

Pierre Boullier

Fonction : Auteur

Software tools for natural language

Résumé

We present a robust full-featured architecture to preprocess text before parsing. This architecture, called SxPipe, converts raw noisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and un-/re-capitalization. Though our system currently deals with the French language, almost all components are in fact language-independent, and the others can be straightforwardly adapted to virtually any inﬂectional language. The output is a sequence of word lattices, all words being present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign and during experiments of large corpora parsing, showing both good efﬁciency and very satisfying precision and recall.

Domaines

Informatique et langage [cs.CL]

Fichier principal

SagotBoullierACS05.pdf (72.51 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Benoît Sagot : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00521228

Soumis le : dimanche 26 septembre 2010-21:50:33

Dernière modification le : vendredi 4 février 2022-03:10:34

Archivage à long terme le : lundi 27 décembre 2010-02:42:51

Dates et versions

inria-00521228 , version 1 (26-09-2010)

Identifiants

HAL Id : inria-00521228 , version 1

Citer

Benoît Sagot, Pierre Boullier. From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe. Archives of Control Sciences, 2005, Language and Technology. Human Language Technologies as a Challenge for Computer Science and Linguistics, 15 (4), pp.653-662. ⟨inria-00521228⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2

138 Consultations

125 Téléchargements

From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SxPipe

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager