Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities

Motaz Saad; David Langlois; Kamel Smaïli

doi:10.1016/j.sbspro.2013.10.620

Article Dans Une Revue Procedia - Social and Behavioral Sciences Année : 2013

Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities

(1) , (1) , (1)

Motaz Saad

Fonction : Auteur
PersonId : 770032
ORCID : 0000-0002-1080-7276
IdRef : 183539524

Analysis, perception and recognition of speech

David Langlois

Fonction : Auteur
PersonId : 298
IdHAL : david-langlois
IdRef : 070239509

Analysis, perception and recognition of speech

Kamel Smaïli

Fonction : Auteur
PersonId : 2521
IdHAL : kamel-smaili
IdRef : 034429700

Analysis, perception and recognition of speech

Résumé

Parallel corpora are not available for all domains and languages, but statistical methods in multilingual research domains require huge parallel/comparable corpora. Comparable corpora can be used when the parallel is not sufficient or not available for specific domains and languages. In this paper, we propose a method to extract all comparable articles from Wikipedia for multiple languages based on interlanguge links. We also extract comparable articles from Euro News website. We also present two comparability measures (CM) to compute the degree of comparability of multilingual articles. We extracted about 40K and 34K comparable articles from Wikipedia and Euro News respectively in three languages including Arabic, French, and English. Experimental results of comparability measures show that our measure can capture the comparability of multilingual corpora and allow to retrieve articles from different language concerning the same topic.

Mots clés

comparability measure computational linguistics comparable corpora

Domaines

Traitement du texte et du document Intelligence artificielle [cs.AI]

Fichier principal

saad-comparable.pdf (278.4 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Motaz Saad : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00907442

Soumis le : lundi 20 novembre 2017-10:13:42

Dernière modification le : lundi 11 septembre 2023-17:41:19

Archivage à long terme le : mercredi 21 février 2018-12:31:24

Dates et versions

hal-00907442 , version 1 (20-11-2017)

Identifiants

HAL Id : hal-00907442 , version 1
DOI : 10.1016/j.sbspro.2013.10.620

Citer

Motaz Saad, David Langlois, Kamel Smaïli. Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities. Procedia - Social and Behavioral Sciences, 2013, 95, pp.40-47. ⟨10.1016/j.sbspro.2013.10.620⟩. ⟨hal-00907442⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD

198 Consultations

160 Téléchargements

Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager