Inférence semi-automatique et interactive de règles avec ou sans vérité terrain pour la reconnaissance de structure de documents

Abstract : The documents to analyze in the document structure analysis are getting more and more complex and the corpora are more and more heterogeneous. We propose a new method, the Eyes Wide Open method (EWO) to introduce a semi-automatic and interactive learning step in the building of grammatical descriptions. With the EWO method, it is possible to benefit from the expressiveness of the syntactical methods while having the adaptability of the statistical methods. The EWO method allows the rules inference to build progressively the full grammatical description of the documents. The rules inference concerns both the logical and the physical structure of the documents. The EWO method relies on two major elements: the automatic discovering of structures with clustering algorithm and an interaction with the user to give sense to the automatically detected structures. Our method allows the rules inference without annotated ground truth on the documents. To do so, the EWO method relies on the analysis of redundancies on big volume of non annotated documents. The redundancy detection is performed automatically with a clustering algorithm. A data reliability enhancement step is performed in interaction with the user on the automatically detected elements to obtain the training labeled data. The EWO method allows an exhaustive and concise view of the data to analyze. It allows a better use of the corpus than for the manually described syntactical method. Furthermore, it allows a better management of the rare cases than what is possible with the statistical method. We validated the efficiency of this method on documents with various structures (handwritten business letters, marriage records, forms...). For each corpus, a grammatical description was generated using the EWO method, obtaining at least similar results to the pre-existing manually described systems. The methowas also successfully applied to a large non annotated corpus.
Document type :
Theses
Complete list of metadatas

https://hal.inria.fr/tel-01492966
Contributor : Aurélie Lemaitre <>
Submitted on : Monday, March 20, 2017 - 4:54:40 PM
Last modification on : Friday, January 11, 2019 - 3:15:15 PM

Identifiers

  • HAL Id : tel-01492966, version 1

Citation

Cérès Carton. Inférence semi-automatique et interactive de règles avec ou sans vérité terrain pour la reconnaissance de structure de documents. Traitement du texte et du document. INSA de Rennes, 2016. Français. ⟨tel-01492966⟩

Share

Metrics

Record views

351

Files downloads

226