A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages

Yves Scherrer 1 Benoît Sagot 2
1 LATL-CUI
LATL - Laboratoire d'Analyse et de Technologie du Langage
2 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS information is transferred from the resourced language along translation pairs to the non-resourced language and used for tagging the corpus. We evaluate our methods on three language families, consisting of five Romance languages, three Germanic languages and five Slavic languages. We obtain tagging accuracies of up to 91.6%.
Type de document :
Communication dans un congrès
Language Resources and Evaluation Conference, May 2014, Reykjavik, Iceland. 2014
Liste complète des métadonnées

Littérature citée [18 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01022298
Contributeur : Benoît Sagot <>
Soumis le : jeudi 10 juillet 2014 - 11:32:53
Dernière modification le : samedi 9 juin 2018 - 10:30:05
Document(s) archivé(s) le : vendredi 10 octobre 2014 - 11:27:13

Fichier

lrec14cll.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01022298, version 1

Collections

Citation

Yves Scherrer, Benoît Sagot. A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages. Language Resources and Evaluation Conference, May 2014, Reykjavik, Iceland. 2014. 〈hal-01022298〉

Partager

Métriques

Consultations de la notice

477

Téléchargements de fichiers

257