Skip to Main content Skip to Navigation
Conference papers

Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler

Gregory Grefenstette 1 Lawrence Muchemi 1
1 TAO - Machine Learning and Optimisation
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LRI - Laboratoire de Recherche en Informatique
Abstract : Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical methods for detecting this vocabulary involve gathering a domain corpus, calculating statistics on the terms found there, and then comparing these statistics to a background or general language corpus. Terms which are found significantly more often in the specialized corpus than in the background corpus are candidates for the characteristic vocabulary of the domain. Here we present two tools, a directed crawler, and a distributional semantics package, that can be used together, circumventing the need of a background corpus. Both tools are available on the web.
Complete list of metadata

Cited literature [24 references]  Display  Hide  Download

https://hal.inria.fr/hal-01323591
Contributor : Gregory Grefenstette <>
Submitted on : Monday, May 30, 2016 - 5:04:29 PM
Last modification on : Wednesday, October 14, 2020 - 3:41:40 AM
Long-term archiving on: : Wednesday, August 31, 2016 - 10:44:51 AM

Files

GLOBALEX_revised.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01323591, version 1
  • ARXIV : 1605.09564

Collections

Citation

Gregory Grefenstette, Lawrence Muchemi. Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler. GLOBALEX 2016: Lexicographic Resources for Human Language Technology, May 2016, Portoroz, Slovenia. ⟨hal-01323591⟩

Share

Metrics

Record views

905

Files downloads

682