Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler

Gregory Grefenstette 1 Lawrence Muchemi 1
1 TAO - Machine Learning and Optimisation
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical methods for detecting this vocabulary involve gathering a domain corpus, calculating statistics on the terms found there, and then comparing these statistics to a background or general language corpus. Terms which are found significantly more often in the specialized corpus than in the background corpus are candidates for the characteristic vocabulary of the domain. Here we present two tools, a directed crawler, and a distributional semantics package, that can be used together, circumventing the need of a background corpus. Both tools are available on the web.
Type de document :
Communication dans un congrès
GLOBALEX 2016: Lexicographic Resources for Human Language Technology, May 2016, Portoroz, Slovenia. 2016
Liste complète des métadonnées

Littérature citée [24 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01323591
Contributeur : Gregory Grefenstette <>
Soumis le : lundi 30 mai 2016 - 17:04:29
Dernière modification le : jeudi 5 avril 2018 - 12:30:12
Document(s) archivé(s) le : mercredi 31 août 2016 - 10:44:51

Fichiers

GLOBALEX_revised.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01323591, version 1
  • ARXIV : 1605.09564

Citation

Gregory Grefenstette, Lawrence Muchemi. Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler. GLOBALEX 2016: Lexicographic Resources for Human Language Technology, May 2016, Portoroz, Slovenia. 2016. 〈hal-01323591〉

Partager

Métriques

Consultations de la notice

664

Téléchargements de fichiers

433