Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access

Gregory Grefenstette 1 Lawrence Muchemi 1
1 TAO - Machine Learning and Optimisation
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Topic models provide a weighted list of terms specific to a given domain. For example, the terminology for painting, as a hobby, might include specific tools user in painting such brush, easel, canvas, as well as more specific terms such a common oil colors: deep aquamarine, cerulean blue, zinc white. For clothing, a topic model should include words such as shoes, boots, socks, skirt, hats, as well as more specific terms such as tennis shoes, cocktail dress, and specific brands of shoes, hats, shirts, etc. In addition to containing the characteristic terms of a topic, a topic model also contains the relative frequency of each term's use in the topic text. This frequency is useful in information retrieval settings; when a large number of results are returned for a query, they can be ordered by pertinence using the relative frequency of domain words to rank the responses. Providing a hierarchic topic model also allows an information retrieval application to create facets (Tunkelang, 2009), or categories appearing the result sets, with which the user can filter results, as on an online shopping site. One problem for many information retrieval platforms in digital humanity archives is the lack of topic models, other than those already foreseen and implemented when the archive was first digitized. A researcher wishing to look at a collection or archive from a new angle has no means of exploiting a new topic model corresponding to his or her axis of research. This obstacle has two causes: (1) technologically, the platform has to allow a re-annotation of the underlying archive with a new topic model. This technological problem is solvable by implementing a suite of natural language processing tools that can access the description of the textual description of elements in the archive, and identify there terms from a new topic model. For example, the commonly used information retrieval platform Lucene (Grainger, 2014) allows the administrator to add new facet annotations to existing documents. A second, more difficult problem is (2) building a new topic model. When done manually, this is a time-consuming task, with no assurance of being complete or adequate, unless great expense is outlayed, as is the case for MeSH, a medical subject heading taxonomy (Coletti and Bleich, 2001), for which regular monthly meetings are held for maintaining and updating the terminology. For subjects less important for society, few such ontological resources exist. When topic models are created automatically they can homogenize existing terminology (Newman et al, 2007) but often result in noise (Steyvers, et al., 2004) that may seem excessive to some archivists.
Type de document :
Communication dans un congrès
Expert Workshop on Topic Models and Corpus Analysis, Dec 2015, Dublin, Ireland
Liste complète des métadonnées

Littérature citée [6 références]  Voir  Masquer  Télécharger

Contributeur : Gregory Grefenstette <>
Soumis le : mercredi 6 janvier 2016 - 08:20:23
Dernière modification le : jeudi 11 janvier 2018 - 06:22:14



  • HAL Id : hal-01251326, version 1


Gregory Grefenstette, Lawrence Muchemi. Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access. Expert Workshop on Topic Models and Corpus Analysis, Dec 2015, Dublin, Ireland. 〈hal-01251326〉



Consultations de la notice


Téléchargements de fichiers