Skip to Main content Skip to Navigation
Conference papers

The Alpage Architecture at the SANCL 2012 Shared Task: Robust Pre-Processing and Lexical Bridging for User-Generated Content Parsing

Abstract : We describe the architecture we set up during the SANCL shared task for parsing user-generated texts, that deviate in various ways from linguistic conventions used in available training treebanks. This architecture focuses in coping with such a divergence. It relies on the PCFG-LA framework (Petrov and Klein, 2007), as implemented by Attia et al. (2010). We explore several techniques to augment robustness: (i) a lexical bridge technique (Candito et al., 2011) that uses unsupervised word clustering (Koo et al., 2008); (ii) a special instanciation of self-training aimed at coping with POS tags unknown to the training set; (iii) the wrapping of a POS tagger with rule-based processing for dealing with recurrent non-standard tokens; and (iv) the guiding of out-of-domain parsing with predicted part-of-speech tags for unknown words and unknown (word, tag) pairs. Our systems ranked second and third out of eight in the constituency parsing track of the SANCL competition.
Document type :
Conference papers
Complete list of metadata

https://hal.inria.fr/hal-00703124
Contributor : Benoît Sagot Connect in order to contact the contributor
Submitted on : Thursday, May 31, 2012 - 9:30:53 PM
Last modification on : Thursday, February 11, 2021 - 2:38:02 PM
Long-term archiving on: : Saturday, September 1, 2012 - 2:31:30 AM

File

SANCL-Alpage.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00703124, version 1

Citation

Djamé Seddah, Benoît Sagot, Marie Candito. The Alpage Architecture at the SANCL 2012 Shared Task: Robust Pre-Processing and Lexical Bridging for User-Generated Content Parsing. First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL), an NAACL-HLT'12 workshop, Jun 2012, Montréal, Canada. ⟨hal-00703124v1⟩

Share

Metrics

Record views

40

Files downloads

49