Le corpus Sequoia : annotation syntaxique et exploitation pour l'adaptation d'analyseur par pont lexical

Abstract : We present the building methodology and the properties of the Sequoia treebank, a freely available French corpus annotated following the French Treebank guidelines (Abeillé et Barrier, 2004). The Sequoia treebank comprises 3204 sentences (69246 tokens), from the French Europarl, the regional newspaper L'Est Républicain, the French Wikipedia and documents from the European Medicines Agency. We then provide a method for parser domain adaptation, that makes use of unsupervised word clusters. The method improves parsing performance on target domains (the domains of the Sequoia corpus), without degrading performance on source domain (the French treenbank test set), contrary to other domain adaptation techniques such as self-training.
Document type :
Conference papers
Complete list of metadatas

Cited literature [22 references]  Display  Hide  Download

https://hal.inria.fr/hal-00698938
Contributor : Marie Candito <>
Submitted on : Friday, May 18, 2012 - 1:34:43 PM
Last modification on : Friday, May 3, 2019 - 1:41:35 AM
Long-term archiving on : Friday, November 30, 2012 - 11:55:43 AM

File

canditoseddah-taln2012-final.p...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00698938, version 1

Citation

Marie Candito, Djamé Seddah. Le corpus Sequoia : annotation syntaxique et exploitation pour l'adaptation d'analyseur par pont lexical. TALN 2012 - 19e conférence sur le Traitement Automatique des Langues Naturelles, Jun 2012, Grenoble, France. ⟨hal-00698938⟩

Share

Metrics

Record views

899

Files downloads

704