Estimating the Number of Concepts

Abstract : Most Natural Language Processing systems have been built around the idea of a word being something found between white spaces and punctuation. This is a normal and efficient way to proceed. Tasks such as Word Sense Disambigua-tion, Machine Translation, or even indexing rarely go beyond the single word. Language models used in NLP applications are built on the word, with a few multiword expressions taken as exceptions. But future NLP systems will neces-sarily venture out into the uncharted areas of multiword expressions. The di-mensions and the topology of multiword concepts are unknown: Are there hun-dreds of thousands or tens of millions? Which words participate in multiword concepts and which do not? As the corpus grows, will their number keep on increasing? In this paper, I estimate the number of multiword concepts that are used in English, systematically probing the Web as our corpus.
Type de document :
Chapitre d'ouvrage
A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks, Menha Publishers, 2010, 978-9970-10-101-6
Liste complète des métadonnées

Littérature citée [13 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01081033
Contributeur : Gregory Grefenstette <>
Soumis le : jeudi 6 novembre 2014 - 17:11:31
Dernière modification le : jeudi 9 février 2017 - 15:47:09
Document(s) archivé(s) le : samedi 7 février 2015 - 11:17:03

Fichier

8._Grefenstette-libre (1).pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01081033, version 1

Collections

Citation

Gregory Grefenstette. Estimating the Number of Concepts. A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks, Menha Publishers, 2010, 978-9970-10-101-6. 〈hal-01081033〉

Partager

Métriques

Consultations de la notice

69

Téléchargements de fichiers

435