A Hybrid Approach to Word Segmentation of Vietnamese Texts

Abstract : We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts.
Type de document :
Communication dans un congrès
2nd International Conference on Language and Automata Theory and Applications - LATA 2008, Mar 2008, Tarragona, Spain. Springer Berlin / Heidelberg, 5196, pp.240-249, 2008, Lecture Notes in Computer Science. 〈10.1007/978-3-540-88282-4_23〉
Liste complète des métadonnées

Littérature citée [7 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00334761
Contributeur : Phuong Le-Hong <>
Soumis le : lundi 27 octobre 2008 - 17:40:24
Dernière modification le : jeudi 11 janvier 2018 - 06:22:10
Document(s) archivé(s) le : mardi 28 juin 2011 - 17:25:12

Fichiers

LATA039.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

Collections

Citation

Hong Phuong Le, Thi Minh Huyen Nguyen, Azim Roussanaly, Tuong Vinh Ho. A Hybrid Approach to Word Segmentation of Vietnamese Texts. 2nd International Conference on Language and Automata Theory and Applications - LATA 2008, Mar 2008, Tarragona, Spain. Springer Berlin / Heidelberg, 5196, pp.240-249, 2008, Lecture Notes in Computer Science. 〈10.1007/978-3-540-88282-4_23〉. 〈inria-00334761〉

Partager

Métriques

Consultations de la notice

381

Téléchargements de fichiers

1486