Détection et correction automatique d'entités nommées dans des corpus OCRisés

Benoît Sagot 1 Kata Gábor 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : Correction of textual data obtained by optical character recognition (OCR) for reaching editorial quality is an expensive task, as it still involves human intervention. The coverage of statistical models for automated error detection and correction is inherently limited to errors that resort to general language. However, a large amount of errors reside in domain-specific named entities, especially when dealing with data such as patent corpora or legal texts. In this paper, we propose a rule-based architecture for the identification and correction of a wide range of named entities (proper names not included). We show that our architecture achieves a good recall and an excellent correction accuracy on error types that are difficult to adress with statistical approaches.
Document type :
Conference papers
Complete list of metadatas

Cited literature [12 references]  Display  Hide  Download

https://hal.inria.fr/hal-01022378
Contributor : Benoît Sagot <>
Submitted on : Thursday, July 10, 2014 - 12:27:25 PM
Last modification on : Thursday, August 29, 2019 - 2:24:02 PM
Long-term archiving on : Friday, October 10, 2014 - 11:36:57 AM

File

taln14pacte_short.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01022378, version 1

Collections

Citation

Benoît Sagot, Kata Gábor. Détection et correction automatique d'entités nommées dans des corpus OCRisés. Traitement Automatique du Langage Naturel 2014, Jul 2014, Marseille, France. ⟨hal-01022378⟩

Share

Metrics

Record views

414

Files downloads

2216