Estimating the Number of Concepts - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Chapitre D'ouvrage Année : 2010

Estimating the Number of Concepts

Gregory Grefenstette

Résumé

Most Natural Language Processing systems have been built around the idea of a word being something found between white spaces and punctuation. This is a normal and efficient way to proceed. Tasks such as Word Sense Disambigua-tion, Machine Translation, or even indexing rarely go beyond the single word. Language models used in NLP applications are built on the word, with a few multiword expressions taken as exceptions. But future NLP systems will neces-sarily venture out into the uncharted areas of multiword expressions. The di-mensions and the topology of multiword concepts are unknown: Are there hun-dreds of thousands or tens of millions? Which words participate in multiword concepts and which do not? As the corpus grows, will their number keep on increasing? In this paper, I estimate the number of multiword concepts that are used in English, systematically probing the Web as our corpus.
Fichier principal
Vignette du fichier
8._Grefenstette-libre (1).pdf (309.52 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01081033 , version 1 (06-11-2014)

Identifiants

  • HAL Id : hal-01081033 , version 1

Citer

Gregory Grefenstette. Estimating the Number of Concepts. A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks, Menha Publishers, 2010, 978-9970-10-101-6. ⟨hal-01081033⟩
49 Consultations
141 Téléchargements

Partager

Gmail Facebook X LinkedIn More