Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies

Lawrence Muchemi 1 Gregory Grefenstette 1, *
* Corresponding author
1 TAO - Machine Learning and Optimisation
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LRI - Laboratoire de Recherche en Informatique
Abstract : In this paper we present two methodologies for rapidly inducing multiple subject-specific taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide Web, DMOZ 1 to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook, Instagram, Flickr) with an objective of enhancing an individual's exploration of their personal information through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that the induced taxonomies are of high quality
Document type :
Preprints, Working Papers, ...
Complete list of metadata

Cited literature [14 references]  Display  Hide  Download

https://hal.inria.fr/hal-01334236
Contributor : Lawrence Muchemi <>
Submitted on : Monday, June 20, 2016 - 4:16:43 PM
Last modification on : Friday, April 30, 2021 - 9:54:41 AM
Long-term archiving on: : Thursday, September 22, 2016 - 7:30:43 PM

File

IJAI2016LM_GG_4June2016.pdf
Files produced by the author(s)

Licence


Copyright

Identifiers

  • HAL Id : hal-01334236, version 1

Citation

Lawrence Muchemi, Gregory Grefenstette. Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies. 2016. ⟨hal-01334236⟩

Share

Metrics

Record views

452

Files downloads

568