SanskritTagger : a stochastic lexical and pos tagger for Sanskrit

Oliver Hellwig

Communication Dans Un Congrès Année : 2007

SanskritTagger : a stochastic lexical and pos tagger for Sanskrit

(1)

Oliver Hellwig

Fonction : Auteur

Institut für die Sprachen und Kulturen Südasiens

Résumé

SanskritTagger is a stochastic tagger for unpreprocessed Sanskrit text. The tagger tokenises text with a Markov model and performs part-of-speech tagging with a Hidden Markov model. Parameters for these processes are estimated from a manually annotated corpus of currently about 1.500.000 words. The article sketches the tagging process, reports the results of tagging a few short passages of Sanskrit text and describes further improvements of the program. The article describes design and function of SanskritTagger, a tokeniser and part-of-speech (POS) tagger, which analyses ”natural”, i.e. unannotated Sanskrit text by repeated application of stochastic models. This tagger has been developped during the last few years as part of a larger project for digitalisation of Sanskrit texts (cmp. (Hellwig, 2002)) and is still in the state of steady improvement. The article is organised as follows: Section 1 gives a short overview about linguistic problems found in Sanskrit texts which influenced the design of the tagger. Section 2 describes the actual implementation of the tagger. In section 3, the performance of the tagger is evaluated on short passages of text from different thematic areas. In addition, this section describes possible improvements in future versions.

Domaines

Traitement du texte et du document

Fichier principal

Hellwig.pdf (184.63 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Brigitte Briot : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00203467

Soumis le : jeudi 10 janvier 2008-11:24:33

Dernière modification le : jeudi 6 janvier 2022-12:14:04

Archivage à long terme le : mardi 13 avril 2010-16:54:28

Dates et versions

inria-00203467 , version 1 (10-01-2008)

Identifiants

HAL Id : inria-00203467 , version 1

Citer

Oliver Hellwig. SanskritTagger : a stochastic lexical and pos tagger for Sanskrit. First International Sanskrit Computational Linguistics Symposium, INRIA Paris-Rocquencourt, Oct 2007, Rocquencourt, France. ⟨inria-00203467⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

SANSKRIT

264 Consultations

385 Téléchargements

SanskritTagger : a stochastic lexical and pos tagger for Sanskrit

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager