The Alpage Architecture at the SANCL 2012 Shared Task: Robust Pre-Processing and Lexical Bridging for User-Generated Content Parsing

Abstract : We describe the architecture we set up during the SANCL shared task for parsing user-generated texts, that deviate in various ways from linguistic conventions used in available training treebanks. This architecture focuses in coping with such a divergence. It relies on the PCFG-LA framework (Petrov and Klein, 2007), as implemented by Attia et al. (2010). We explore several techniques to augment robustness: (i) a lexical bridge technique (Candito et al., 2011) that uses unsupervised word clustering (Koo et al., 2008); (ii) a special instanciation of self-training aimed at coping with POS tags unknown to the training set; (iii) the wrapping of a POS tagger with rule-based processing for dealing with recurrent non-standard tokens; and (iv) the guiding of out-of-domain parsing with predicted part-of-speech tags for unknown words and unknown (word, tag) pairs. Our systems ranked second and third out of eight in the constituency parsing track of the SANCL competition.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00703124
Contributor : Benoît Sagot <>
Submitted on : Friday, June 1, 2012 - 4:54:06 PM
Last modification on : Friday, May 3, 2019 - 1:41:35 AM
Long-term archiving on : Sunday, September 2, 2012 - 2:46:45 AM

File

SANCL-Alpage.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00703124, version 2

Citation

Djamé Seddah, Benoît Sagot, Marie Candito. The Alpage Architecture at the SANCL 2012 Shared Task: Robust Pre-Processing and Lexical Bridging for User-Generated Content Parsing. SANCL 2012 - First Workshop on Syntactic Analysis of Non-Canonical Language an NAACL-HLT'12 workshop, Jun 2012, Montréal, Canada. ⟨hal-00703124v2⟩

Share

Metrics

Record views

406

Files downloads

514