Discovering Indicators for Classifying Wikipedia Articles in a Domain: A Case Study on Software Languages

Marcel Heinz 1 Ralf Lämmel 1 Mathieu Acher 2
2 DiverSe - Diversity-centric Software Engineering
Inria Rennes – Bretagne Atlantique , IRISA-D4 - LANGAGE ET GÉNIE LOGICIEL
Abstract : Wikipedia is a rich source of information across many knowledge domains. Yet, recovering articles relevant to a specific domain is a difficult problem since such articles may be rare and tend to cover multiple topics. Furthermore, Wikipedia's categories provide an ambiguous classification of articles as they relate to all topics and thus are of limited use. In this paper, we develop a new methodology to isolate Wikipedia's articles that describe a specific topic within the scope of relevant categories; the methodology uses supervised machine learning to retrieve a decision tree classifier based on articles' features (URL patterns, summary text, infoboxes, links from list articles). In a case study, we retrieve 3000+ articles that describe software (computer) languages. Available fragments of ground truths serve as an essential part of the training set to detect relevant articles. The results of the classification are thoroughly evaluated through a survey, in which 31 domain experts participated.
Document type :
Conference papers
Complete list of metadatas

Cited literature [1 references]  Display  Hide  Download

https://hal.inria.fr/hal-02129131
Contributor : Mathieu Acher <>
Submitted on : Tuesday, May 14, 2019 - 4:03:04 PM
Last modification on : Friday, September 13, 2019 - 9:48:41 AM

File

Discovering_Indicators_for_Cla...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02129131, version 1

Citation

Marcel Heinz, Ralf Lämmel, Mathieu Acher. Discovering Indicators for Classifying Wikipedia Articles in a Domain: A Case Study on Software Languages. SEKE 2019 - The 31st International Conference on Software Engineering and Knowledge Engineering, Jul 2019, Lisbonne, Portugal. pp.1-6. ⟨hal-02129131⟩

Share

Metrics

Record views

82

Files downloads

314