A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts

Abstract : We present a simple and effective way to perform out-of-domain statistical parsing by drastically reducing lexical data sparseness in a PCFG-LA architecture. We replace terminal symbols with unsupervised word clusters acquired from a large newspaper corpus augmented with biomedical target- domain data. The resulting clusters are effective in bridging the lexical gap between source-domain and target-domain vocabularies. Our experiments combine known self-training techniques with unsupervised word clustering and produce promising results, achieving an error reduction of 21% on a new evaluation set for biomedical text with manual bracketing annotations.
Type de document :
Communication dans un congrès
IWPT'11 - 12th International Conference on Parsing Technologies, Oct 2011, Dublin, Ireland. 2011
Liste complète des métadonnées

Littérature citée [21 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00659577
Contributeur : Marie Candito <>
Soumis le : vendredi 13 janvier 2012 - 10:38:57
Dernière modification le : mardi 17 avril 2018 - 11:25:46
Document(s) archivé(s) le : samedi 14 avril 2012 - 02:22:50

Fichier

IWPT2011-candito_henestro_sedd...
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00659577, version 1

Collections

Citation

Marie Candito, Enrique Henestroza Anguiano, Djamé Seddah. A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts. IWPT'11 - 12th International Conference on Parsing Technologies, Oct 2011, Dublin, Ireland. 2011. 〈hal-00659577〉

Partager

Métriques

Consultations de la notice

209

Téléchargements de fichiers

215