A Maximum Entropy Approach to Sentence Boundary Detection of Vietnamese Texts - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2008

A Maximum Entropy Approach to Sentence Boundary Detection of Vietnamese Texts

Résumé

We present for the first time a sentence boundary detection system for identifying sentence boundaries in Vietnamese texts. The system is based on a maximum entropy model. The training procedure requires no hand-crafted rules, lexicon, or domain-specific information. Given a corpus annotated with sentence boundaries, the model learns to classify each occurrence of potential end-of-sentence punctuations as either a valid or invalid sentence boundary. Performance of the system on a Vietnamese corpus achieved a good recall ratio of about 95%. The approach has been implemented to create a software tool named vnSentDetector, a plug-in of the open source software framework vnToolkit which is intended to be a general framework integrating useful tools for processing of Vietnamese texts.
Fichier principal
Vignette du fichier
rivf2008.pdf (107.72 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

inria-00334762 , version 1 (27-10-2008)

Identifiants

  • HAL Id : inria-00334762 , version 1

Citer

Hong Phuong Le, Tuong Vinh Ho. A Maximum Entropy Approach to Sentence Boundary Detection of Vietnamese Texts. IEEE International Conference on Research, Innovation and Vision for the Future - RIVF 2008, Jul 2008, Ho Chi Minh City, Vietnam. ⟨inria-00334762⟩
197 Consultations
832 Téléchargements

Partager

Gmail Facebook X LinkedIn More