The French Social Media Bank: a Treebank of Noisy User Generated Content

Abstract : In recent years, statistical parsers have reached high performance levels on well-edited texts. Domain adaptation techniques have improved parsing results on text genres differing from the journalistic data most parsers are trained on. However, such corpora usually comply with standard linguistic, spelling and typographic conventions. In the meantime, the emergence of Web 2.0 communication media has caused the apparition of new types of online textual data. Although valuable, e.g., in terms of data mining and sentiment analysis, such user-generated content rarely complies with standard conventions: they are noisy. This prevents most NLP tools, especially treebank based parsers, from performing well on such data. For this reason, we have developed the French Social Media Bank, the first user-generated content treebank for French, a morphologically rich language (MRL). The first release of this resource contains 1,700 sentences from various Web 2.0 sources, including data specifically chosen for their high noisiness. We describe here how we created this treebank and expose the methodology we used for fully annotating it. We also provide baseline POS tagging and statistical constituency parsing results, which are lower by far than usual results on edited texts. This highlights the high difficulty of automatically processing such noisy data in a MRL.
Document type :
Conference papers
COLING 2012 - 24th International Conference on Computational Linguistics, Dec 2012, Mumbai, India. 2012
Liste complète des métadonnées

Cited literature [34 references]  Display  Hide  Download

https://hal.inria.fr/hal-00780895
Contributor : Djamé Seddah <>
Submitted on : Friday, January 25, 2013 - 12:56:51 AM
Last modification on : Friday, August 31, 2018 - 9:24:04 AM
Document(s) archivé(s) le : Friday, April 26, 2013 - 3:54:56 AM

File

coling2012.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00780895, version 1

Collections

Citation

Djamé Seddah, Benoît Sagot, Marie Candito, Virginie Mouilleron, Vanessa Combet. The French Social Media Bank: a Treebank of Noisy User Generated Content. COLING 2012 - 24th International Conference on Computational Linguistics, Dec 2012, Mumbai, India. 2012. 〈hal-00780895〉

Share

Metrics

Record views

731

Files downloads

610