A Hybrid Approach to Word Segmentation of Vietnamese Texts

Hong Phuong Le; Thi Minh Huyen Nguyen; Azim Roussanaly; Tuong Vinh Ho

doi:10.1007/978-3-540-88282-4_23

Communication Dans Un Congrès Année : 2008

A Hybrid Approach to Word Segmentation of Vietnamese Texts

(1) , (2) , (1) , (3)

1
2
3

Hong Phuong Le

Fonction : Auteur
PersonId : 835932

Knowledge Information and Web Intelligence

Thi Minh Huyen Nguyen

Fonction : Auteur
PersonId : 855072

Faculté de Mathématiques, Mécanique et Informatique

Azim Roussanaly

Fonction : Auteur
PersonId : 170011
IdHAL : azim-roussanaly
ORCID : 0000-0002-3311-3613
IdRef : 034126732

Knowledge Information and Web Intelligence

Tuong Vinh Ho

Fonction : Auteur
PersonId : 855077

Modélisation et Simulation Informatique de systèmes complexes

Résumé

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.

Domaines

Traitement du texte et du document

Fichier principal

LATA039.pdf (132.92 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Phuong Le-Hong : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00334761

Soumis le : lundi 27 octobre 2008-17:40:24

Dernière modification le : vendredi 24 mars 2023-14:52:51

Archivage à long terme le : mardi 28 juin 2011-17:25:12

Dates et versions

inria-00334761 , version 1 (27-10-2008)

Identifiants

HAL Id : inria-00334761 , version 1
DOI : 10.1007/978-3-540-88282-4_23

Citer

Hong Phuong Le, Thi Minh Huyen Nguyen, Azim Roussanaly, Tuong Vinh Ho. A Hybrid Approach to Word Segmentation of Vietnamese Texts. 2nd International Conference on Language and Automata Theory and Applications - LATA 2008, Mar 2008, Tarragona, Spain. pp.240-249, ⟨10.1007/978-3-540-88282-4_23⟩. ⟨inria-00334761⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE LORIA

251 Consultations

2765 Téléchargements

A Hybrid Approach to Word Segmentation of Vietnamese Texts

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager