Building a Large Syntactically-Annotated Corpus of Vietnamese

Abstract : Treebank is an important resource for both research and application of natural language processing. For Vietnamese, we still lack such kind of corpora. This paper presents up-to-date results of a project for Vietnamese treebank construction. Since Vietnamese is an isolating language and has no word delimiter, there are many ambiguities in sentence analysis. We systematically applied a lot of linguistic techniques to handle such ambiguities. Annotators are supported by automatic labeling tools and a tree-editor tool. Raw texts are extracted from Tuoi Tre (Youth), an online Vietnamese daily newspaper. The current annotation agreement is around 90 percent.
Type de document :
Communication dans un congrès
The Third Linguistic Annotation Workshop - The LAW III, Aug 2009, Singapour, Singapore. 6p., 2009
Liste complète des métadonnées

Littérature citée [7 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00421103
Contributeur : Phuong Le-Hong <>
Soumis le : mardi 15 décembre 2009 - 13:47:31
Dernière modification le : mardi 24 avril 2018 - 13:37:29
Document(s) archivé(s) le : samedi 26 novembre 2016 - 14:53:51

Fichier

LAW09_Thai_final.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00421103, version 2

Collections

Citation

Phuong Thai Nguyen, Xuan Luong Vu, Thi Minh Huyen Nguyen, Van Hiep Nguyen, Hong Phuong Le. Building a Large Syntactically-Annotated Corpus of Vietnamese. The Third Linguistic Annotation Workshop - The LAW III, Aug 2009, Singapour, Singapore. 6p., 2009. 〈inria-00421103v2〉

Partager

Métriques

Consultations de la notice

748

Téléchargements de fichiers

494