A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2011

A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts

Résumé

We present a simple and effective way to perform out-of-domain statistical parsing by drastically reducing lexical data sparseness in a PCFG-LA architecture. We replace terminal symbols with unsupervised word clusters acquired from a large newspaper corpus augmented with biomedical target- domain data. The resulting clusters are effective in bridging the lexical gap between source-domain and target-domain vocabularies. Our experiments combine known self-training techniques with unsupervised word clustering and produce promising results, achieving an error reduction of 21% on a new evaluation set for biomedical text with manual bracketing annotations.
Fichier principal
Vignette du fichier
IWPT2011-candito_henestro_seddah.pdf (61.36 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00659577 , version 1 (13-01-2012)

Identifiants

  • HAL Id : hal-00659577 , version 1

Citer

Marie Candito, Enrique Henestroza Anguiano, Djamé Seddah. A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts. IWPT'11 - 12th International Conference on Parsing Technologies, Oct 2011, Dublin, Ireland. ⟨hal-00659577⟩
211 Consultations
184 Téléchargements

Partager

Gmail Facebook X LinkedIn More