hal-00753912, version 1
Semantic Clustering using Bag-of-Bag-of-Features
Ali-Reza Ebadat 1Vincent Claveau
1Pascale Sébillot
1
CORIA - COnférence en Recherche d'Information et Applications (2012) 229-244
Résumé : Computing distances between textual representation is at the heart of many Natural Language Processing tasks. The standard approaches initially developed for Information Retrieval are then used; most often they rely on a bag-of-words (or bag-of-feature) description with a TF-IDF (or variants) weighting, a vectorial representation and classical similarity functions like cosine. In this paper, we are interested in such a task, namely the semantic clustering of entities extracted from a text. We argue that for this kind of tasks, more suited representations and similarity measures can be used. In particular, we explore the use of alternative representation for entities called Bag-Of-Vectors (or Bag-of-Bags-of-Features). In this new model, each entity is not defined as a unique vector but as a set of vectors, in which each vector is built based on the contextual features of one occurrence of the entity. In order to use Bag-Of-Vectors for clustering, we introduce new versions of classical similarity functions such as Cosine, Jaccard and Scalar Products. Experimentally, we show that the Bag-Of-Vectors representation always improve the clustering results compared to classical Bag-Of-Features representations.
- 1 : TEXMEX (INRIA - IRISA)
- CNRS : UMR6074 – INRIA – Institut National des Sciences Appliquées (INSA) - Rennes – Université de Rennes 1
- Domaine : Informatique/Traitement du texte et du document
- Mots-clés : vector representation – bag-of-bag-of-words – bag-of-vectors – similarity – clustering
- hal-00753912, version 1
- http://hal.archives-ouvertes.fr/hal-00753912
- oai:hal.archives-ouvertes.fr:hal-00753912
- Contributeur : Pascale Sébillot
- Soumis le : Lundi 19 Novembre 2012, 19:35:35
- Dernière modification le : Jeudi 17 Janvier 2013, 13:17:58






Documents associés
Exporter