A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts

Marie Candito; Enrique Henestroza Anguiano; Djamé Seddah

Communication Dans Un Congrès Année : 2011

A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts

(1) , (1) , (1, 2)

1
2

Marie Candito

Fonction : Auteur
PersonId : 13596
IdHAL : marie-candito
IdRef : 153698616

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Enrique Henestroza Anguiano

Fonction : Auteur
PersonId : 878441

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Djamé Seddah

Fonction : Auteur
PersonId : 11545
IdHAL : djameseddah
IdRef : 086185136

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Institut des Sciences Humaines Appliquées

Résumé

We present a simple and effective way to perform out-of-domain statistical parsing by drastically reducing lexical data sparseness in a PCFG-LA architecture. We replace terminal symbols with unsupervised word clusters acquired from a large newspaper corpus augmented with biomedical target- domain data. The resulting clusters are effective in bridging the lexical gap between source-domain and target-domain vocabularies. Our experiments combine known self-training techniques with unsupervised word clustering and produce promising results, achieving an error reduction of 21% on a new evaluation set for biomedical text with manual bracketing annotations.

Mots clés

statistical parsing domain adaptation biomedical texts

Domaines

Traitement du texte et du document

Fichier principal

IWPT2011-candito_henestro_seddah.pdf (61.36 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Marie Candito : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00659577

Soumis le : vendredi 13 janvier 2012-10:38:57

Dernière modification le : jeudi 15 février 2024-03:31:35

Archivage à long terme le : samedi 14 avril 2012-02:22:50

Dates et versions

hal-00659577 , version 1 (13-01-2012)

Identifiants

HAL Id : hal-00659577 , version 1

Citer

Marie Candito, Enrique Henestroza Anguiano, Djamé Seddah. A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts. IWPT'11 - 12th International Conference on Parsing Technologies, Oct 2011, Dublin, Ireland. ⟨hal-00659577⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-PARIS7 UNIV-RENNES1 INRIA IRISA INRIA2 CAMPUS-AAR AAI UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SORBONNE-UNIVERSITE SU-LETTRES UR1-MATH-NUM

211 Consultations

184 Téléchargements

A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager