The Alpage Architecture at the SANCL 2012 Shared Task: Robust Pre-Processing and Lexical Bridging for User-Generated Content Parsing

Abstract : We describe the architecture we set up during the SANCL shared task for parsing user-generated texts, that deviate in various ways from linguistic conventions used in available training treebanks. This architecture focuses in coping with such a divergence. It relies on the PCFG-LA framework (Petrov and Klein, 2007), as implemented by Attia et al. (2010). We explore several techniques to augment robustness: (i) a lexical bridge technique (Candito et al., 2011) that uses unsupervised word clustering (Koo et al., 2008); (ii) a special instanciation of self-training aimed at coping with POS tags unknown to the training set; (iii) the wrapping of a POS tagger with rule-based processing for dealing with recurrent non-standard tokens; and (iv) the guiding of out-of-domain parsing with predicted part-of-speech tags for unknown words and unknown (word, tag) pairs. Our systems ranked second and third out of eight in the constituency parsing track of the SANCL competition.
Type de document :
Communication dans un congrès
SANCL 2012 - First Workshop on Syntactic Analysis of Non-Canonical Language an NAACL-HLT'12 workshop, Jun 2012, Montréal, Canada. 2012
Liste complète des métadonnées

https://hal.inria.fr/hal-00703124
Contributeur : Benoît Sagot <>
Soumis le : vendredi 1 juin 2012 - 16:54:06
Dernière modification le : samedi 9 juin 2018 - 10:30:06
Document(s) archivé(s) le : dimanche 2 septembre 2012 - 02:46:45

Fichier

SANCL-Alpage.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00703124, version 2

Collections

Citation

Djamé Seddah, Benoît Sagot, Marie Candito. The Alpage Architecture at the SANCL 2012 Shared Task: Robust Pre-Processing and Lexical Bridging for User-Generated Content Parsing. SANCL 2012 - First Workshop on Syntactic Analysis of Non-Canonical Language an NAACL-HLT'12 workshop, Jun 2012, Montréal, Canada. 2012. 〈hal-00703124v2〉

Partager

Métriques

Consultations de la notice

353

Téléchargements de fichiers

233