Skip to Main content Skip to Navigation
Conference papers

Word segmentation of Vietnamese texts: a comparison of approaches

Abstract : We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.
Document type :
Conference papers
Complete list of metadata

https://hal.inria.fr/inria-00334760
Contributor : Phuong Le-Hong <>
Submitted on : Tuesday, October 28, 2008 - 7:16:31 AM
Last modification on : Thursday, July 15, 2021 - 10:40:04 AM
Long-term archiving on: : Tuesday, June 28, 2011 - 5:41:43 PM

Files

lrec2008final.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00334760, version 1

Collections

Citation

Quang Thang Dinh, Hong Phuong Le, Thi Minh Huyen Nguyen, Cam Tu Nguyen, Mathias Rossignol, et al.. Word segmentation of Vietnamese texts: a comparison of approaches. 6th international conference on Language Resources and Evaluation - LREC 2008, ELRA - European Language Resources Association, May 2008, Marrakech, Morocco. ⟨inria-00334760⟩

Share

Metrics

Record views

962

Files downloads

905