Skip to Main content Skip to Navigation
Conference papers

Discovering Indicators for Classifying Wikipedia Articles in a Domain: A Case Study on Software Languages

Marcel Heinz 1 Ralf Lämmel 1 Mathieu Acher 2
2 DiverSe - Diversity-centric Software Engineering
Inria Rennes – Bretagne Atlantique , IRISA-D4 - LANGAGE ET GÉNIE LOGICIEL
Abstract : Wikipedia is a rich source of information across many knowledge domains. Yet, recovering articles relevant to a specific domain is a difficult problem since such articles may be rare and tend to cover multiple topics. Furthermore, Wikipedia's categories provide an ambiguous classification of articles as they relate to all topics and thus are of limited use. In this paper, we develop a new methodology to isolate Wikipedia's articles that describe a specific topic within the scope of relevant categories; the methodology uses supervised machine learning to retrieve a decision tree classifier based on articles' features (URL patterns, summary text, infoboxes, links from list articles). In a case study, we retrieve 3000+ articles that describe software (computer) languages. Available fragments of ground truths serve as an essential part of the training set to detect relevant articles. The results of the classification are thoroughly evaluated through a survey, in which 31 domain experts participated.
Document type :
Conference papers
Complete list of metadata

Cited literature [21 references]  Display  Hide  Download
Contributor : Mathieu Acher Connect in order to contact the contributor
Submitted on : Tuesday, May 14, 2019 - 4:03:04 PM
Last modification on : Wednesday, November 3, 2021 - 8:13:42 AM


Files produced by the author(s)


  • HAL Id : hal-02129131, version 1


Marcel Heinz, Ralf Lämmel, Mathieu Acher. Discovering Indicators for Classifying Wikipedia Articles in a Domain: A Case Study on Software Languages. SEKE 2019 - The 31st International Conference on Software Engineering and Knowledge Engineering, Jul 2019, Lisbonne, Portugal. pp.1-6. ⟨hal-02129131⟩



Les métriques sont temporairement indisponibles