Automatic Wrapper Induction from Hidden-Web Sources with Domain Knowledge

Marc Tommasi 1, 2, 3 Rémi Gilleron 2, 3, 1 Pierre Senellart 4 Avin Mittal Daniel Muschick
1 MOSTRARE - Modeling Tree Structures, Machine Learning, and Information Extraction
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe
4 GEMO - Integration of data and knowledge distributed over the web
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its fields to concepts of the domain, and understanding how resulting records are represented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with domain instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.
Type de document :
Communication dans un congrès
International Workshop on Web information and data managment, Oct 2008, Napa, United States. ACM, pp.9-16, 2008
Liste complète des métadonnées

https://hal.inria.fr/inria-00337098
Contributeur : Marc Tommasi <>
Soumis le : jeudi 6 novembre 2008 - 09:39:57
Dernière modification le : jeudi 5 avril 2018 - 12:30:12

Identifiants

  • HAL Id : inria-00337098, version 1

Collections

Citation

Marc Tommasi, Rémi Gilleron, Pierre Senellart, Avin Mittal, Daniel Muschick. Automatic Wrapper Induction from Hidden-Web Sources with Domain Knowledge. International Workshop on Web information and data managment, Oct 2008, Napa, United States. ACM, pp.9-16, 2008. 〈inria-00337098〉

Partager

Métriques

Consultations de la notice

413