Word segmentation of Vietnamese texts: a comparison of approaches

Quang Thang Dinh; Hong Phuong Le; Thi Minh Huyen Nguyen; Cam Tu Nguyen; Mathias Rossignol; Xuan Luong Vu

Communication Dans Un Congrès Année : 2008

Word segmentation of Vietnamese texts: a comparison of approaches

(1) , (2) , (1) , (3) , (4) , (1)

1
2
3
4

Quang Thang Dinh

Fonction : Auteur
PersonId : 855075

Faculté de Mathématiques, Mécanique et Informatique

Hong Phuong Le

Fonction : Auteur
PersonId : 835932

Knowledge Information and Web Intelligence

Thi Minh Huyen Nguyen

Fonction : Auteur
PersonId : 855072

Faculté de Mathématiques, Mécanique et Informatique

Cam Tu Nguyen

Fonction : Auteur
PersonId : 855076

Vietlex

Mathias Rossignol

Fonction : Auteur

Multimédia, Informations, Communication et Applications

Xuan Luong Vu

Fonction : Auteur

Faculté de Mathématiques, Mécanique et Informatique

Résumé

We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.

Domaines

Traitement du texte et du document

Fichier principal

lrec2008final.pdf (85 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Phuong Le-Hong : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00334760

Soumis le : mardi 28 octobre 2008-07:16:31

Dernière modification le : lundi 15 avril 2024-10:34:02

Archivage à long terme le : mardi 28 juin 2011-17:41:43

Dates et versions

inria-00334760 , version 1 (28-10-2008)

Identifiants

HAL Id : inria-00334760 , version 1

Citer

Quang Thang Dinh, Hong Phuong Le, Thi Minh Huyen Nguyen, Cam Tu Nguyen, Mathias Rossignol, et al.. Word segmentation of Vietnamese texts: a comparison of approaches. 6th international conference on Language Resources and Evaluation - LREC 2008, ELRA - European Language Resources Association, May 2008, Marrakech, Morocco. ⟨inria-00334760⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA UNIV-LORRAINE LORIA

530 Consultations

686 Téléchargements

Word segmentation of Vietnamese texts: a comparison of approaches

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager