Skip to Main content Skip to Navigation
New interface
Book sections

Estimating the Number of Concepts

Abstract : Most Natural Language Processing systems have been built around the idea of a word being something found between white spaces and punctuation. This is a normal and efficient way to proceed. Tasks such as Word Sense Disambigua-tion, Machine Translation, or even indexing rarely go beyond the single word. Language models used in NLP applications are built on the word, with a few multiword expressions taken as exceptions. But future NLP systems will neces-sarily venture out into the uncharted areas of multiword expressions. The di-mensions and the topology of multiword concepts are unknown: Are there hun-dreds of thousands or tens of millions? Which words participate in multiword concepts and which do not? As the corpus grows, will their number keep on increasing? In this paper, I estimate the number of multiword concepts that are used in English, systematically probing the Web as our corpus.
Document type :
Book sections
Complete list of metadata

Cited literature [13 references]  Display  Hide  Download
Contributor : Gregory Grefenstette Connect in order to contact the contributor
Submitted on : Thursday, November 6, 2014 - 5:11:31 PM
Last modification on : Friday, February 4, 2022 - 3:08:58 AM
Long-term archiving on: : Saturday, February 7, 2015 - 11:17:03 AM


8._Grefenstette-libre (1).pdf
Files produced by the author(s)


  • HAL Id : hal-01081033, version 1



Gregory Grefenstette. Estimating the Number of Concepts. A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks, Menha Publishers, 2010, 978-9970-10-101-6. ⟨hal-01081033⟩



Record views


Files downloads