Skip to Main content Skip to Navigation
Conference papers

Metrics for Complete Evaluation of OCR Performance

Romain Karpinski 1 Devashish Lohani 1 Abdel Belaid 2, 1
1 READ - Recognition of writing and analysis of documents
LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we study metrics for evaluating OCR performance both in terms of physical segmentation and in terms of textual content recognition. These metrics rely on the OCR output (hypothesis) and the reference (also called ground truth) input format. Two evaluation criteria are considered: the quality of segmentation and the character recognition rate. Three pairs of input formats are selected among two types of inputs: text only (text) and text with spatial information (xml). These pairs of inputs reference-to-hypothesis are: 1) text-to-text, 2) xml-to-xml and 3) text-to-xml. For the text-to-text pair, we selected the RETAS method to perform experiments and show its limits. Regarding text-to-xml, a new method based on unique word anchors is proposed to solve the problem of aligning texts with different information. We define the ZoneMapAltCnt metric for the xml-to-xml approach and show that it offers the most reliable and complete evaluation compared to the other two. Open source OCRs like Tesseract and OCRopus are selected to perform experiments. The datasets used are a collection of documents from the ISTEX 1 document database, from French newspaper "Le Nouvel Observateur" as well as invoices and administrative document gathered from different collaborations.
Document type :
Conference papers
Complete list of metadata

Cited literature [16 references]  Display  Hide  Download

https://hal.inria.fr/hal-01981731
Contributor : Abdel Belaid <>
Submitted on : Tuesday, January 15, 2019 - 11:24:14 AM
Last modification on : Friday, January 15, 2021 - 5:42:02 PM
Long-term archiving on: : Tuesday, April 16, 2019 - 1:31:13 PM

File

Paper-Devashish.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01981731, version 1

Collections

Citation

Romain Karpinski, Devashish Lohani, Abdel Belaid. Metrics for Complete Evaluation of OCR Performance. IPCV'18 - The 22nd Int'l Conf on Image Processing, Computer Vision, & Pattern Recognition, Jul 2018, Las Vegas, United States. ⟨hal-01981731⟩

Share

Metrics

Record views

1466

Files downloads

2145