Can MDL Improve Unsupervised Chinese Word Segmentation?

Pierre Magistry 1 Benoît Sagot 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : It is often assumed that Minimum Descrip- tion Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Man- darin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algo- rithms previously proposed in the litera- ture. Suprisingly, we show that this lower Description Length does not necessarily corresponds to better segmentation results. Finally, we show that we can use very basic linguistic knowledge to coerce the MDL towards a linguistically plausible hypoth- esis and obtain better results than any pre- viously proposed method for unsupervised Chinese word segmentation with minimal human effort.
Document type :
Conference papers
Complete list of metadatas

Cited literature [13 references]  Display  Hide  Download

https://hal.inria.fr/hal-00876389
Contributor : Pierre Magistry <>
Submitted on : Thursday, October 24, 2013 - 1:58:13 PM
Last modification on : Thursday, August 29, 2019 - 2:24:09 PM
Long-term archiving on : Friday, April 7, 2017 - 6:15:46 PM

File

sighan7.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00876389, version 1

Collections

Citation

Pierre Magistry, Benoît Sagot. Can MDL Improve Unsupervised Chinese Word Segmentation?. Sixth International Joint Conference on Natural Language Processing: Sighan workshop, Oct 2013, Nagoya, Japan. pp.2. ⟨hal-00876389⟩

Share

Metrics

Record views

406

Files downloads

484