Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities

Motaz Saad 1 David Langlois 1 Kamel Smaïli 1
1 PAROLE - Analysis, perception and recognition of speech
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Parallel corpora are not available for all domains and languages, but statistical methods in multilingual research domains require huge parallel/comparable corpora. Comparable corpora can be used when the parallel is not sufficient or not available for specific domains and languages. In this paper, we propose a method to extract all comparable articles from Wikipedia for multiple languages based on interlanguge links. We also extract comparable articles from Euro News website. We also present two comparability measures (CM) to compute the degree of comparability of multilingual articles. We extracted about 40K and 34K comparable articles from Wikipedia and Euro News respectively in three languages including Arabic, French, and English. Experimental results of comparability measures show that our measure can capture the comparability of multilingual corpora and allow to retrieve articles from different language concerning the same topic.
Type de document :
Article dans une revue
Procedia - Social and Behavioral Sciences, Elsevier, 2013, 95, pp.40-47. 〈10.1016/j.sbspro.2013.10.620〉
Liste complète des métadonnées

Littérature citée [12 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00907442
Contributeur : Motaz Saad <>
Soumis le : lundi 20 novembre 2017 - 10:13:42
Dernière modification le : jeudi 11 janvier 2018 - 06:25:24

Fichier

saad-comparable.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Motaz Saad, David Langlois, Kamel Smaïli. Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities. Procedia - Social and Behavioral Sciences, Elsevier, 2013, 95, pp.40-47. 〈10.1016/j.sbspro.2013.10.620〉. 〈hal-00907442〉

Partager

Métriques

Consultations de la notice

209

Téléchargements de fichiers

8