A Maximum Entropy Approach to Sentence Boundary Detection of Vietnamese Texts

Abstract : We present for the first time a sentence boundary detection system for identifying sentence boundaries in Vietnamese texts. The system is based on a maximum entropy model. The training procedure requires no hand-crafted rules, lexicon, or domain-specific information. Given a corpus annotated with sentence boundaries, the model learns to classify each occurrence of potential end-of-sentence punctuations as either a valid or invalid sentence boundary. Performance of the system on a Vietnamese corpus achieved a good recall ratio of about 95%. The approach has been implemented to create a software tool named vnSentDetector, a plug-in of the open source software framework vnToolkit which is intended to be a general framework integrating useful tools for processing of Vietnamese texts.
Type de document :
Communication dans un congrès
IEEE International Conference on Research, Innovation and Vision for the Future - RIVF 2008, Jul 2008, Ho Chi Minh City, Vietnam. 2008
Liste complète des métadonnées

Littérature citée [24 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00334762
Contributeur : Phuong Le-Hong <>
Soumis le : lundi 27 octobre 2008 - 17:47:57
Dernière modification le : mardi 24 avril 2018 - 13:30:43
Document(s) archivé(s) le : lundi 7 juin 2010 - 19:30:46

Fichier

rivf2008.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

  • HAL Id : inria-00334762, version 1

Collections

Citation

Hong Phuong Le, Tuong Vinh Ho. A Maximum Entropy Approach to Sentence Boundary Detection of Vietnamese Texts. IEEE International Conference on Research, Innovation and Vision for the Future - RIVF 2008, Jul 2008, Ho Chi Minh City, Vietnam. 2008. 〈inria-00334762〉

Partager

Métriques

Consultations de la notice

251

Téléchargements de fichiers

573