Can MDL Improve Unsupervised Chinese Word Segmentation? - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2013

Can MDL Improve Unsupervised Chinese Word Segmentation?

Résumé

It is often assumed that Minimum Descrip- tion Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Man- darin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algo- rithms previously proposed in the litera- ture. Suprisingly, we show that this lower Description Length does not necessarily corresponds to better segmentation results. Finally, we show that we can use very basic linguistic knowledge to coerce the MDL towards a linguistically plausible hypoth- esis and obtain better results than any pre- viously proposed method for unsupervised Chinese word segmentation with minimal human effort.
Fichier principal
Vignette du fichier
sighan7.pdf (149.66 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00876389 , version 1 (24-10-2013)

Identifiants

  • HAL Id : hal-00876389 , version 1

Citer

Pierre Magistry, Benoît Sagot. Can MDL Improve Unsupervised Chinese Word Segmentation?. Sixth International Joint Conference on Natural Language Processing: Sighan workshop, Oct 2013, Nagoya, Japan. pp.2. ⟨hal-00876389⟩
234 Consultations
270 Téléchargements

Partager

Gmail Facebook X LinkedIn More