Can MDL Improve Unsupervised Chinese Word Segmentation?

Pierre Magistry 1 Benoît Sagot 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : It is often assumed that Minimum Descrip- tion Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Man- darin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algo- rithms previously proposed in the litera- ture. Suprisingly, we show that this lower Description Length does not necessarily corresponds to better segmentation results. Finally, we show that we can use very basic linguistic knowledge to coerce the MDL towards a linguistically plausible hypoth- esis and obtain better results than any pre- viously proposed method for unsupervised Chinese word segmentation with minimal human effort.
Type de document :
Communication dans un congrès
Sixth International Joint Conference on Natural Language Processing: Sighan workshop, Oct 2013, Nagoya, Japan. pp.2, 2013
Liste complète des métadonnées

Littérature citée [13 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00876389
Contributeur : Pierre Magistry <>
Soumis le : jeudi 24 octobre 2013 - 13:58:13
Dernière modification le : jeudi 15 novembre 2018 - 20:27:26
Document(s) archivé(s) le : vendredi 7 avril 2017 - 18:15:46

Fichier

sighan7.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00876389, version 1

Collections

Citation

Pierre Magistry, Benoît Sagot. Can MDL Improve Unsupervised Chinese Word Segmentation?. Sixth International Joint Conference on Natural Language Processing: Sighan workshop, Oct 2013, Nagoya, Japan. pp.2, 2013. 〈hal-00876389〉

Partager

Métriques

Consultations de la notice

366

Téléchargements de fichiers

269