Inférence semi-automatique et interactive de règles sans vérité terrain

Abstract : Dealing with non annotated documents for the design of a document recognition system is not an easy task. In general, statistical methods cannot learn without an annotated ground truth, unlike syntactical methods. However their ability to deal with non annotated data comes from the fact that the description is manually made by a user. The adaptation to a new kind of document is then tedious as the whole manual process of extraction of knowledge has to be redone. In this paper, we propose a method to extract knowledge and generate rules without any ground truth. Using large volume of non annotated documents, it is possible to study redundancies of some extracted elements in the document images. The redundancy is exploited through an automatic clustering algorithm. An interaction with the user brings semantic to the detected clusters. In this work, the extracted elements are some keywords extracted with word spotting. This approach has been applied to old marriage record field detection on the Family-Search HIP2013 competition database. The results demonstrate that we successfully automatically infer rules from non annotated documents using the redundancy of extracted elements of the documents.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-01492921
Contributor : Aurélie Lemaitre <>
Submitted on : Tuesday, March 21, 2017 - 8:45:04 AM
Last modification on : Thursday, February 7, 2019 - 3:04:59 PM
Long-term archiving on : Thursday, June 22, 2017 - 12:14:50 PM

File

CIFED_2016_paper_15.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01492921, version 1

Citation

Cérès Carton, Aurélie Lemaitre, Bertrand Coüasnon. Inférence semi-automatique et interactive de règles sans vérité terrain. Conférence Internationale Francophone sur l'Ecrit et le Document (CIFED'2016), Mar 2016, Toulouse, France. ⟨hal-01492921⟩

Share

Metrics

Record views

442

Files downloads

79