Mining monolingual and bilingual corpora

Chiraz Latiri; Kamel Smaïli; Caroline Lavecchia; David Langlois

Article Dans Une Revue Intelligent Data Analysis Année : 2010

Mining monolingual and bilingual corpora

(1) , (2) , (2) , (2)

1
2

Chiraz Latiri

Fonction : Auteur
PersonId : 763545
ORCID : 0000-0001-9728-0683

Unité de Recherche en Programmation Algorithmique et Heuristique

Kamel Smaïli

Fonction : Auteur
PersonId : 2521
IdHAL : kamel-smaili
IdRef : 034429700

Analysis, perception and recognition of speech

Caroline Lavecchia

Fonction : Auteur

Analysis, perception and recognition of speech

David Langlois

Fonction : Auteur
PersonId : 298
IdHAL : david-langlois
IdRef : 070239509

Analysis, perception and recognition of speech

Résumé

In this paper, we describe two new methods of mining monolingual and bilingual text corpora that heavily rely on the use of association rules and triggers. The association rules based method is firstly applied in query expansion. The conducted experiments on French newspapers and on a set of scientific documents show that the proposed approach outperforms the baseline model. The second method focuses on the machine translation and is motivated by the results of triggers on statistical language modeling. In order to build up a translation table, association rules and triggers are then generalized to mine bilingual corpora. In this respect, we propose respectively the concepts of inter-lingual association rules and inter-lingual triggers. Both methods have been integrated in a real statistical machine translation. Carried out experiments highlight the practical feasibility of the introduced approaches in the context of machine translation and show that inter-lingual triggers achieve better results than those obtained using the third IBM model.

Mots clés

BLEU score Statistical machine translation Inter-lingual triggers Triggers Generic basis Association rule Galois closure operator Formal Concept Analysis

Domaines

Intelligence artificielle [cs.AI] Traitement du texte et du document Recherche d'information [cs.IR]

Fichier principal

Revue2-13.pdf (318.02 Ko)

David Langlois : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00545493

Soumis le : mardi 21 novembre 2017-15:03:13

Dernière modification le : vendredi 24 mars 2023-14:53:05

Dates et versions

inria-00545493 , version 1 (21-11-2017)

Identifiants

HAL Id : inria-00545493 , version 1

Citer

Chiraz Latiri, Kamel Smaïli, Caroline Lavecchia, David Langlois. Mining monolingual and bilingual corpora. Intelligent Data Analysis, 2010, 14 (6), pp.663-682. ⟨inria-00545493⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA

267 Consultations

138 Téléchargements

Mining monolingual and bilingual corpora

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager