Skip to Main content Skip to Navigation
New interface
Conference papers

Détection et correction automatique d'entités nommées dans des corpus OCRisés

Benoît Sagot 1 Kata Gábor 1 
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : Correction of textual data obtained by optical character recognition (OCR) for reaching editorial quality is an expensive task, as it still involves human intervention. The coverage of statistical models for automated error detection and correction is inherently limited to errors that resort to general language. However, a large amount of errors reside in domain-specific named entities, especially when dealing with data such as patent corpora or legal texts. In this paper, we propose a rule-based architecture for the identification and correction of a wide range of named entities (proper names not included). We show that our architecture achieves a good recall and an excellent correction accuracy on error types that are difficult to adress with statistical approaches.
Document type :
Conference papers
Complete list of metadata

Cited literature [12 references]  Display  Hide  Download
Contributor : Benoît Sagot Connect in order to contact the contributor
Submitted on : Thursday, July 10, 2014 - 12:27:25 PM
Last modification on : Thursday, November 3, 2022 - 3:52:20 AM
Long-term archiving on: : Friday, October 10, 2014 - 11:36:57 AM


Files produced by the author(s)


  • HAL Id : hal-01022378, version 1


Benoît Sagot, Kata Gábor. Détection et correction automatique d'entités nommées dans des corpus OCRisés. Traitement Automatique du Langage Naturel 2014, Jul 2014, Marseille, France. ⟨hal-01022378⟩



Record views


Files downloads