HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access

Gregory Grefenstette 1 Lawrence Muchemi 1
1 TAO - Machine Learning and Optimisation
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LRI - Laboratoire de Recherche en Informatique
Abstract : Topic models provide a weighted list of terms specific to a given domain. For example, the terminology for painting, as a hobby, might include specific tools user in painting such brush, easel, canvas, as well as more specific terms such a common oil colors: deep aquamarine, cerulean blue, zinc white. For clothing, a topic model should include words such as shoes, boots, socks, skirt, hats, as well as more specific terms such as tennis shoes, cocktail dress, and specific brands of shoes, hats, shirts, etc. In addition to containing the characteristic terms of a topic, a topic model also contains the relative frequency of each term's use in the topic text. This frequency is useful in information retrieval settings; when a large number of results are returned for a query, they can be ordered by pertinence using the relative frequency of domain words to rank the responses. Providing a hierarchic topic model also allows an information retrieval application to create facets (Tunkelang, 2009), or categories appearing the result sets, with which the user can filter results, as on an online shopping site. One problem for many information retrieval platforms in digital humanity archives is the lack of topic models, other than those already foreseen and implemented when the archive was first digitized. A researcher wishing to look at a collection or archive from a new angle has no means of exploiting a new topic model corresponding to his or her axis of research. This obstacle has two causes: (1) technologically, the platform has to allow a re-annotation of the underlying archive with a new topic model. This technological problem is solvable by implementing a suite of natural language processing tools that can access the description of the textual description of elements in the archive, and identify there terms from a new topic model. For example, the commonly used information retrieval platform Lucene (Grainger, 2014) allows the administrator to add new facet annotations to existing documents. A second, more difficult problem is (2) building a new topic model. When done manually, this is a time-consuming task, with no assurance of being complete or adequate, unless great expense is outlayed, as is the case for MeSH, a medical subject heading taxonomy (Coletti and Bleich, 2001), for which regular monthly meetings are held for maintaining and updating the terminology. For subjects less important for society, few such ontological resources exist. When topic models are created automatically they can homogenize existing terminology (Newman et al, 2007) but often result in noise (Steyvers, et al., 2004) that may seem excessive to some archivists.
Document type :
Conference papers
Complete list of metadata

Cited literature [6 references]  Display  Hide  Download

Contributor : Gregory Grefenstette Connect in order to contact the contributor
Submitted on : Wednesday, January 6, 2016 - 8:20:23 AM
Last modification on : Thursday, April 7, 2022 - 3:09:01 AM



  • HAL Id : hal-01251326, version 1


Gregory Grefenstette, Lawrence Muchemi. Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access. Expert Workshop on Topic Models and Corpus Analysis, DARIAH Text & Data Analytics Working Group, Dec 2015, Dublin, Ireland. ⟨hal-01251326⟩



Record views


Files downloads