WEIR-P: An Information Extraction Pipeline for the Wastewater Domain - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

WEIR-P: An Information Extraction Pipeline for the Wastewater Domain

Nanée Chahinian
Laurent Deruelle
  • Fonction : Auteur

Résumé

Urbanization has been an increasing trend over the past century (UN, 2018) and city managers have had to constantly extend water access and sanitation services to new peripheral areas. Originally these networks were installed, operated, and repaired by their owners (Rogers et al. 2012). However, as concessions were increasingly granted to private companies and new tenders requested regularly by public authorities, archives were sometimes misplaced and event logs were lost. Thus, part of the networks’ operational history was thought to be permanently erased. The advent of Web big data and text-mining techniques may offer the possibility of recovering some of this knowledge by crawling secondary information sources, i.e. documents available on the Web. Thus, insight might be gained on the wastewater collection scheme, the treatment processes, the network’s geometry and events (accidents, shortages) which may have affected these facilities and amenities. The primary aim of the "Megadata, Linked Data and Data Mining for Wastewater Networks" (MeDo) project (http://webmedo.msem.univ-montp2.fr/?page_id=223&lang=en), is to develop resources for text mining and information extraction in the wastewater domain. We developed a specific Natural Language Processing (NLP) pipeline named WEIR-P (WastewatEr InfoRmation extraction Platform) which allows users to retrieve relevant documents for a given network, process them to extract potentially new information, assess this information also by using an interactive visualization and add it to a pre-existing knowledge base. The system identifies the entities and relations to be extracted from texts, pertaining network information, wastewater treatment, accidents and works, organizations, spatio-temporal information, measures and water quality. We present and evaluate the first version of the NLP system. The preliminary results obtained on the Montpellier corpus (1,557 HTML and PDF documents in French) are encouraging and show how a mix of Machine Learning approaches and rule-based techniques can be used to extract useful information and reconstruct the various phases of the extension of a given wastewater network. While the NLP and Information Extraction (IE) methods used are state of the art, the novelty of our work lies in their adaptation to the domain, and in particular in the wastewater management conceptual model, which defines the relations between entities.
EGU21-2708-print(2).pdf (277.92 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03161715 , version 1 (08-03-2021)

Identifiants

Citer

Nanée Chahinian, Thierry Bonnabaud La Bruyère, Serge Conrad, Carole Delenne, Francesca Frontini, et al.. WEIR-P: An Information Extraction Pipeline for the Wastewater Domain. EGU General Assembly 2021, Apr 2021, Virtual, France. ⟨10.5194/egusphere-egu21-2708⟩. ⟨hal-03161715⟩
251 Consultations
97 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More