Segmentation non supervisée : le cas du mandarin

Pierre Magistry 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : In this paper, we present an unsupervised segmentation system tested on Mandarine Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii (2005) reformulation, we base our work on the Variation of branching Entropy. We improve on (Jin et Tanaka-Ishii, 2006) by adding normalization and Viterbi-decoding. This enables us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 2011) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervised system available off-the-shelf (Zhang et Clark, 2010; Zhao et Kit, 2008; Huang et Zhao, 2007)
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00701197
Contributor : Pierre Magistry <>
Submitted on : Thursday, May 24, 2012 - 5:10:23 PM
Last modification on : Friday, January 4, 2019 - 5:33:24 PM

Identifiers

  • HAL Id : hal-00701197, version 1

Collections

Citation

Pierre Magistry. Segmentation non supervisée : le cas du mandarin. RECITAL - Rencontres des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues - 2012, ATALA, Jun 2012, Grenoble, France. ⟨hal-00701197⟩

Share

Metrics

Record views

211