Interactive Tuples Extraction from Semi-Structured Data

This paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tu- ples of length i − 1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper.

Domaines

Web Apprentissage [cs.LG]

Fichier principal

WI2006.pdf (204.99 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Patrick Marty : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00581253

Soumis le : mercredi 30 mars 2011-15:07:14

Dernière modification le : vendredi 24 mars 2023-14:52:54

Archivage à long terme le : samedi 3 décembre 2016-06:11:12

Dates et versions

inria-00581253 , version 1 (30-03-2011)

Identifiants

HAL Id : inria-00581253 , version 1

Citer

Rémi Gilleron, Patrick Marty, Fabien Torre, Marc Tommasi. Interactive Tuples Extraction from Semi-Structured Data. Web Intelligence, Dec 2006, Hong Kong, China. ⟨inria-00581253⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-LILLE3 CNRS INRIA LIFL MOSTRARE INRIA2

137 Consultations

375 Téléchargements