Word segmentation of Vietnamese texts: a comparison of approaches

Abstract : We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.
Type de document :
Communication dans un congrès
6th international conference on Language Resources and Evaluation - LREC 2008, May 2008, Marrakech, Morocco. 2008
Liste complète des métadonnées

https://hal.inria.fr/inria-00334760
Contributeur : Phuong Le-Hong <>
Soumis le : mardi 28 octobre 2008 - 07:16:31
Dernière modification le : jeudi 11 janvier 2018 - 06:22:10
Document(s) archivé(s) le : mardi 28 juin 2011 - 17:41:43

Fichiers

lrec2008final.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00334760, version 1

Collections

Citation

Quang Thang Dinh, Hong Phuong Le, Thi Minh Huyen Nguyen, Cam Tu Nguyen, Mathias Rossignol, et al.. Word segmentation of Vietnamese texts: a comparison of approaches. 6th international conference on Language Resources and Evaluation - LREC 2008, May 2008, Marrakech, Morocco. 2008. 〈inria-00334760〉

Partager

Métriques

Consultations de la notice

625

Téléchargements de fichiers

621