Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies

Lawrence Muchemi 1 Gregory Grefenstette 1, *
* Auteur correspondant
1 TAO - Machine Learning and Optimisation
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : In this paper we present two methodologies for rapidly inducing multiple subject-specific taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide Web, DMOZ 1 to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook, Instagram, Flickr) with an objective of enhancing an individual's exploration of their personal information through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that the induced taxonomies are of high quality
Type de document :
Pré-publication, Document de travail
2016
Liste complète des métadonnées

Littérature citée [14 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01334236
Contributeur : Lawrence Muchemi <>
Soumis le : lundi 20 juin 2016 - 16:16:43
Dernière modification le : jeudi 11 janvier 2018 - 01:49:38
Document(s) archivé(s) le : jeudi 22 septembre 2016 - 19:30:43

Fichier

IJAI2016LM_GG_4June2016.pdf
Fichiers produits par l'(les) auteur(s)

Licence


Copyright (Tous droits réservés)

Identifiants

  • HAL Id : hal-01334236, version 1

Citation

Lawrence Muchemi, Gregory Grefenstette. Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies. 2016. 〈hal-01334236〉

Partager

Métriques

Consultations de la notice

323

Téléchargements de fichiers

248