Unsupervized Word Segmentation: the case for Mandarin Chinese

Pierre Magistry 1 Benoît Sagot 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : In this paper, we present an unsupervised segmentation system tested on Mandarine Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii (2005) reformulation, we base our work on the Variation of branching Entropy. We improve on (Jin et Tanaka-Ishii, 2006) by adding normalization and Viterbi-decoding. This enables us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 2011) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervised system available off-the-shelf (Zhang et Clark, 2010; Zhao et Kit, 2008; Huang et Zhao, 2007)
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00701200
Contributor : Pierre Magistry <>
Submitted on : Thursday, May 24, 2012 - 5:16:04 PM
Last modification on : Friday, January 4, 2019 - 5:33:24 PM

Identifiers

  • HAL Id : hal-00701200, version 1

Collections

Citation

Pierre Magistry, Benoît Sagot. Unsupervized Word Segmentation: the case for Mandarin Chinese. ACL - Annual Meeting of the Association for Computational Linguistics - 2012, ACL, Jul 2012, Jeju, South Korea. ⟨hal-00701200⟩

Share

Metrics

Record views

274