A Word Clustering Approach to Domain Adaptation: Robust parsing of source and target domains

Abstract : We present a technique to improve out-of-domain statistical parsing by reducing lexical data sparseness in a PCFG-LA architecture. We replace ter- minal symbols with unsupervised word clusters acquired from a large news- paper corpus augmented with target-domain data. We also investigate the impact of guiding out-of-domain parsing with predicted part-of-speech tags. We provide an evaluation for French, and obtain improvements in perfor- mance for both non-technical and technical target domains. Though the im- provements over a strong baseline are slight, an interesting result is that the proposed techniques also improve parsing performance on the source do- main, contrary to techniques such as self-training, thus leading to a more ro- bust parser overall. We also describe new target domain evaluation treebanks, freely available, that comprise a total of about 3,000 annotated sentences from the medical domain, regional newspaper articles, French Europarl and French Wikipedia.
Type de document :
Article dans une revue
Liste complète des métadonnées

https://hal.inria.fr/hal-00940224
Contributeur : Djamé Seddah <>
Soumis le : vendredi 31 janvier 2014 - 15:52:42
Dernière modification le : vendredi 12 janvier 2018 - 15:34:01

Identifiants

Collections

Citation

Djamé Seddah, Marie Candito, Enrique Henestroza Anguiano, Henestroza Anguiano Enrique. A Word Clustering Approach to Domain Adaptation: Robust parsing of source and target domains. Journal of Logic and Computation, Oxford University Press (OUP), 2013, 〈http://logcom.oxfordjournals.org/content/early/2013/02/22/logcom.exs082.full.pdf+html〉. 〈10.1093/logcom/exs082〉. 〈hal-00940224〉

Partager

Métriques

Consultations de la notice

150