Détection et correction automatique d'entités nommées dans des corpus OCRisés

Benoît Sagot 1 Kata Gábor 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : Correction of textual data obtained by optical character recognition (OCR) for reaching editorial quality is an expensive task, as it still involves human intervention. The coverage of statistical models for automated error detection and correction is inherently limited to errors that resort to general language. However, a large amount of errors reside in domain-specific named entities, especially when dealing with data such as patent corpora or legal texts. In this paper, we propose a rule-based architecture for the identification and correction of a wide range of named entities (proper names not included). We show that our architecture achieves a good recall and an excellent correction accuracy on error types that are difficult to adress with statistical approaches.
Document type :
Conference papers
Complete list of metadatas

Cited literature [12 references]  Display  Hide  Download

Contributor : Benoît Sagot <>
Submitted on : Thursday, July 10, 2014 - 12:27:25 PM
Last modification on : Thursday, August 29, 2019 - 2:24:02 PM
Long-term archiving on : Friday, October 10, 2014 - 11:36:57 AM


Files produced by the author(s)


  • HAL Id : hal-01022378, version 1



Benoît Sagot, Kata Gábor. Détection et correction automatique d'entités nommées dans des corpus OCRisés. Traitement Automatique du Langage Naturel 2014, Jul 2014, Marseille, France. ⟨hal-01022378⟩



Record views


Files downloads