Segmentation et induction de lexique non-supervisées du mandarin

Pierre Magistry 1 Benoît Sagot 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : For most languages using the Latin alphabet, tokenizing a text on spaces and punctuation marks is a good approximation of a segmentation into lexical units. Although this approximation hides many difficulties, they do not compare with those arising when dealing with languages that do not use spaces, such as Mandarin Chinese. Many segmentation systems have been proposed, some of them use linguistitically motivated unsuper- vized algorithms. However, standard evaluation practices fail to account for some properties of such systems. In this paper, we show that a simple model, based on an entropy-based reformulation of a language-independent hy- pothesis put forward by Harris (1955), allows for segmenting a corpus and extracting a lexicon from the results. Tested on the Academia Sinica Corpus, our system allows for inducing a segmentation and a lexicon with good in- trinsic properties and whose characteristics are similar to those of the lexicon underlying the manually-segmented corpus. Moreover, the results of the segmentation model correlate with the syntactic structures provided by the syntactically annotated subpart of the corpus.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/inria-00605899
Contributor : Pierre Magistry <>
Submitted on : Monday, July 4, 2011 - 4:53:51 PM
Last modification on : Thursday, August 29, 2019 - 2:24:09 PM

Identifiers

  • HAL Id : inria-00605899, version 1

Collections

Citation

Pierre Magistry, Benoît Sagot. Segmentation et induction de lexique non-supervisées du mandarin. TALN'2011 - Traitement Automatique des Langues Naturelles, ATALA, Jun 2011, Montpellier, France. ⟨inria-00605899⟩

Share

Metrics

Record views

153