Service interruption on Monday 11 July from 12:30 to 13:00: all the sites of the CCSD (HAL, EpiSciences, SciencesConf, AureHAL) will be inaccessible (network hardware connection).
Skip to Main content Skip to Navigation
Conference papers

Metrics for Complete Evaluation of OCR Performance

Romain Karpinski 1 Devashish Lohani 1 Abdel Belaid 2, 1 
1 READ - Recognition of writing and analysis of documents
LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we study metrics for evaluating OCR performance both in terms of physical segmentation and in terms of textual content recognition. These metrics rely on the OCR output (hypothesis) and the reference (also called ground truth) input format. Two evaluation criteria are considered: the quality of segmentation and the character recognition rate. Three pairs of input formats are selected among two types of inputs: text only (text) and text with spatial information (xml). These pairs of inputs reference-to-hypothesis are: 1) text-to-text, 2) xml-to-xml and 3) text-to-xml. For the text-to-text pair, we selected the RETAS method to perform experiments and show its limits. Regarding text-to-xml, a new method based on unique word anchors is proposed to solve the problem of aligning texts with different information. We define the ZoneMapAltCnt metric for the xml-to-xml approach and show that it offers the most reliable and complete evaluation compared to the other two. Open source OCRs like Tesseract and OCRopus are selected to perform experiments. The datasets used are a collection of documents from the ISTEX 1 document database, from French newspaper "Le Nouvel Observateur" as well as invoices and administrative document gathered from different collaborations.
Document type :
Conference papers
Complete list of metadata

Cited literature [16 references]  Display  Hide  Download
Contributor : Abdel Belaid Connect in order to contact the contributor
Submitted on : Tuesday, January 15, 2019 - 11:24:14 AM
Last modification on : Wednesday, November 3, 2021 - 7:56:49 AM
Long-term archiving on: : Tuesday, April 16, 2019 - 1:31:13 PM


Files produced by the author(s)


  • HAL Id : hal-01981731, version 1



Romain Karpinski, Devashish Lohani, Abdel Belaid. Metrics for Complete Evaluation of OCR Performance. IPCV'18 - The 22nd Int'l Conf on Image Processing, Computer Vision, & Pattern Recognition, Jul 2018, Las Vegas, United States. ⟨hal-01981731⟩



Record views


Files downloads