Integrating Data from the Web by Machine-Learning Tree-Pattern Queries

Abstract : Effienct and reliable integration of web data requires building programs called wrappers. Hand writting wrappers is tedious and error prone. Constant changes in the web, also implies that wrappers need to be constantly refactored. Machine learning has proven to be useful, but current techniques are either limited in expressivity, require non-intuitive user interaction or do not allow for n-ary extraction. We study using tree-patterns as an n-ary extraction language and propose an algorithm learning such queries. It calculates the most information-conservative tree-pattern which is a generalization of two input trees. A notable aspect is that the approach allows to learn queries containing both child and descendant relationships between nodes. More importantly, the proposed approach does not require any labeling other than the data which the user effectively wants to extract. The experiments reported show the effectiveness of the approach.
Type de document :
Communication dans un congrès
Robert Meersman and Zahir Tari. 5th International Conference on Ontologies, Databases, and Applications of Semantics, 2006, Montpellier, France. Springer Verlag, 4275, pp.941-948, 2006, Lecture Notes in Computer Science
Liste complète des métadonnées

https://hal.inria.fr/inria-00536547
Contributeur : Denis Debarbieux <>
Soumis le : mardi 16 novembre 2010 - 13:53:22
Dernière modification le : jeudi 11 janvier 2018 - 06:22:13

Identifiants

  • HAL Id : inria-00536547, version 1

Collections

Citation

Benjamin Habegger, Denis Debarbieux. Integrating Data from the Web by Machine-Learning Tree-Pattern Queries. Robert Meersman and Zahir Tari. 5th International Conference on Ontologies, Databases, and Applications of Semantics, 2006, Montpellier, France. Springer Verlag, 4275, pp.941-948, 2006, Lecture Notes in Computer Science. 〈inria-00536547〉

Partager

Métriques

Consultations de la notice

158