Automatic and interactive rule inference without ground truth

Abstract : Dealing with non annotated documents for the design of a document recognition system is not an easy task. In general, statistical methods cannot learn without an annotated ground truth, unlike syntactical methods. However their ability to deal with non annotated data comes from the fact that the description is manually made by a user. The adaptation to a new kind of document is then tedious as the whole manual process of extraction of knowledge has to be redone. In this paper, we propose a method to extract knowledge and generate rules without any ground truth. Using large volume of non annotated documents, it is possible to study redundancies of some extracted elements in the document images. The redundancy is exploited through an automatic clustering algorithm. An interaction with the user brings semantic to the detected clusters. In this work, the extracted elements are some keywords extracted with word spotting. This approach has been applied to old marriage record field detection on the FamilySearch HIP2013 competition database. The results demonstrate that we successfully automatically infer rules from non annotated documents using the redundancy of extracted elements of the documents.
Document type :
Conference papers
Complete list of metadatas

Cited literature [13 references]  Display  Hide  Download

https://hal.inria.fr/hal-01197470
Contributor : Cérès Carton <>
Submitted on : Friday, September 11, 2015 - 5:33:17 PM
Last modification on : Friday, November 16, 2018 - 1:35:40 AM
Long-term archiving on : Tuesday, December 29, 2015 - 12:45:34 AM

File

icdar_2015_ccarton_hal.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01197470, version 1

Citation

Cérès Carton, Aurélie Lemaitre, Bertrand Coüasnon. Automatic and interactive rule inference without ground truth. International Conference on Document Analysis and Recognition (ICDAR), Aug 2015, Nancy, France. ⟨hal-01197470⟩

Share

Metrics

Record views

422

Files downloads

270