Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies

Lawrence Muchemi 1 Gregory Grefenstette 1, * 
* Corresponding author
1 TAO - Machine Learning and Optimisation
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : In this paper we present two methodologies for rapidly inducing multiple subject-specific taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide Web, DMOZ 1 to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook, Instagram, Flickr) with an objective of enhancing an individual's exploration of their personal information through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that the induced taxonomies are of high quality
Document type :
Preprints, Working Papers, ...
Complete list of metadata

Cited literature [14 references]  Display  Hide  Download
Contributor : Lawrence Muchemi Connect in order to contact the contributor
Submitted on : Monday, June 20, 2016 - 4:16:43 PM
Last modification on : Saturday, June 25, 2022 - 10:20:54 PM
Long-term archiving on: : Thursday, September 22, 2016 - 7:30:43 PM


Files produced by the author(s)




  • HAL Id : hal-01334236, version 1


Lawrence Muchemi, Gregory Grefenstette. Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies. 2016. ⟨hal-01334236⟩



Record views


Files downloads