Skip to Main content Skip to Navigation
New interface
Conference papers

Segmentation et induction de lexique non-supervisées du mandarin

Pierre Magistry 1 Benoît Sagot 1 
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : For most languages using the Latin alphabet, tokenizing a text on spaces and punctuation marks is a good approximation of a segmentation into lexical units. Although this approximation hides many difficulties, they do not compare with those arising when dealing with languages that do not use spaces, such as Mandarin Chinese. Many segmentation systems have been proposed, some of them use linguistitically motivated unsuper- vized algorithms. However, standard evaluation practices fail to account for some properties of such systems. In this paper, we show that a simple model, based on an entropy-based reformulation of a language-independent hy- pothesis put forward by Harris (1955), allows for segmenting a corpus and extracting a lexicon from the results. Tested on the Academia Sinica Corpus, our system allows for inducing a segmentation and a lexicon with good in- trinsic properties and whose characteristics are similar to those of the lexicon underlying the manually-segmented corpus. Moreover, the results of the segmentation model correlate with the syntactic structures provided by the syntactically annotated subpart of the corpus.
Document type :
Conference papers
Complete list of metadata
Contributor : Pierre Magistry Connect in order to contact the contributor
Submitted on : Monday, July 4, 2011 - 4:53:51 PM
Last modification on : Friday, January 21, 2022 - 3:21:20 AM


  • HAL Id : inria-00605899, version 1



Pierre Magistry, Benoît Sagot. Segmentation et induction de lexique non-supervisées du mandarin. TALN'2011 - Traitement Automatique des Langues Naturelles, ATALA, Jun 2011, Montpellier, France. ⟨inria-00605899⟩



Record views