Interactive Learning of Node Selecting Tree Transducers

Julien Carme 1 Rémi Gilleron 1 Aurélien Lemay 1 Joachim Niehren 1
1 MOSTRARE - Modeling Tree Structures, Machine Learning, and Information Extraction
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe
Abstract : We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers NNSTs, a particular class of tree automata that we introduce. We prove that deterministic NNSTs capture the class of queries definable in monadic second order logic (MSO) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by NNSTs can be answered efficiently. We present a new polynomial time algorithm in RPNI-style that learns monadic queries defined by deterministic NNSTs from completely annotated examples, where all selected nodes are distinguished. In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree pruning heuristics. We introduce pruning NSTTs - a formalism that shares many advantages of NSTTs. This leads us to an interactive learning algorithm for monadic queries defined by pruning NSTTs, which satisfies a new formal active learning model in the style of Angluin (1887). We have implemented our interactive learning algorithm and integrated it into a visually interactive Web information extraction system -- called SQUIRREL -- by plugging it into the Mozilla Web browser. Experiments on realistic Web documents confirm excellent quality with very few user interactions during wrapper induction.
Type de document :
Article dans une revue
Machine Learning, Springer Verlag, 2007, Machine Learning, 66 (1), pp.33-67. 〈10.1007/s10994-006-9613-8〉
Liste complète des métadonnées

Littérature citée [26 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00087226
Contributeur : Joachim Niehren <>
Soumis le : vendredi 9 avril 2010 - 11:12:23
Dernière modification le : jeudi 11 janvier 2018 - 06:22:13
Document(s) archivé(s) le : vendredi 24 septembre 2010 - 12:24:11

Fichier

0.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Julien Carme, Rémi Gilleron, Aurélien Lemay, Joachim Niehren. Interactive Learning of Node Selecting Tree Transducers. Machine Learning, Springer Verlag, 2007, Machine Learning, 66 (1), pp.33-67. 〈10.1007/s10994-006-9613-8〉. 〈inria-00087226v5〉

Partager

Métriques

Consultations de la notice

540

Téléchargements de fichiers

274