Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access

Gregory Grefenstette; Lawrence Muchemi

Communication Dans Un Congrès Année : 2015

Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access

(1) , (1)

Gregory Grefenstette

Fonction : Auteur
PersonId : 2537
IdHAL : gregory-grefenstette
ORCID : 0000-0001-8479-049X
IdRef : 075539381

Machine Learning and Optimisation

Lawrence Muchemi

Fonction : Auteur
PersonId : 974349

Machine Learning and Optimisation

Résumé

Topic models provide a weighted list of terms specific to a given domain. For example, the terminology for painting, as a hobby, might include specific tools user in painting such brush, easel, canvas, as well as more specific terms such a common oil colors: deep aquamarine, cerulean blue, zinc white. For clothing, a topic model should include words such as shoes, boots, socks, skirt, hats, as well as more specific terms such as tennis shoes, cocktail dress, and specific brands of shoes, hats, shirts, etc. In addition to containing the characteristic terms of a topic, a topic model also contains the relative frequency of each term's use in the topic text. This frequency is useful in information retrieval settings; when a large number of results are returned for a query, they can be ordered by pertinence using the relative frequency of domain words to rank the responses. Providing a hierarchic topic model also allows an information retrieval application to create facets (Tunkelang, 2009), or categories appearing the result sets, with which the user can filter results, as on an online shopping site. One problem for many information retrieval platforms in digital humanity archives is the lack of topic models, other than those already foreseen and implemented when the archive was first digitized. A researcher wishing to look at a collection or archive from a new angle has no means of exploiting a new topic model corresponding to his or her axis of research. This obstacle has two causes: (1) technologically, the platform has to allow a re-annotation of the underlying archive with a new topic model. This technological problem is solvable by implementing a suite of natural language processing tools that can access the description of the textual description of elements in the archive, and identify there terms from a new topic model. For example, the commonly used information retrieval platform Lucene (Grainger, 2014) allows the administrator to add new facet annotations to existing documents. A second, more difficult problem is (2) building a new topic model. When done manually, this is a time-consuming task, with no assurance of being complete or adequate, unless great expense is outlayed, as is the case for MeSH, a medical subject heading taxonomy (Coletti and Bleich, 2001), for which regular monthly meetings are held for maintaining and updating the terminology. For subjects less important for society, few such ontological resources exist. When topic models are created automatically they can homogenize existing terminology (Newman et al, 2007) but often result in noise (Steyvers, et al., 2004) that may seem excessive to some archivists.

Domaines

Informatique et langage [cs.CL]

TopicModelExperience.pdf (81.33 Ko)

Format : Papier court
Origine : Fichiers produits par l'(les) auteur(s)

Gregory Grefenstette : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01251326

Soumis le : mercredi 6 janvier 2016-08:20:23

Dernière modification le : mardi 13 février 2024-03:25:12

Dates et versions

hal-01251326 , version 1 (06-01-2016)

Identifiants

HAL Id : hal-01251326 , version 1

Citer

Gregory Grefenstette, Lawrence Muchemi. Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access. Expert Workshop on Topic Models and Corpus Analysis, DARIAH Text & Data Analytics Working Group, Dec 2015, Dublin, Ireland. ⟨hal-01251326⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA UMR8623 CENTRALESUPELEC INRIA2 LRI-AO UR1-MATH-STIC UNIV-PARIS-SACLAY UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM GS-COMPUTER-SCIENCE

172 Consultations

60 Téléchargements

Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager