Modélisation statistique du langage à partir d'Internet pour la reconnaissance automatique de la parole continue

Abstract : In statistical language modelling researches, there is a lack of huge text corpora, especially for spoken language modelling. This thesis deals with using Internet documents in order to train such statistical models. After gathering corpora, we highlighted several interesting properties like the huge quantity of text, the number of different French lexical forms and especially the ability of finding spoken dialog utterances. This kind of utterances is not present in usual journalistic corpora even if these corpora are widely used. During the past years, the evolution of Internet documents increased this adequacy. This thesis also introduces a new fully automatic method to compute statistical language models on Internet data. This method starts with a special filter called "minimal blocks" only based on the lexicon. Next, with modified computing algorithms, we can obtain statistical models like n-grams. Results using this method are about 90% of word accuracy for small vocabulary and about 80% of words accuracy for larger ones. Moreover, results on a state of the art audio corpus given by AUPELF for evaluation, without any kind of adaptation, are close to those obtained by other research teams. In this thesis, we also report other applications of Internet documents. Indeed, using the French newsgroups hierarchy, we can compute a topic detector based on normalized unigrams models. Topic detection accuracy is about 70%. Using this topic detector in speech recognition algorithms can increase word accuracy by up to of 5%. At last, a derived approach from "minimal blocks" method has been applied to define a set of sentences to record an audio corpus.
Document type :
Theses
Interface homme-machine [cs.HC]. Université Joseph-Fourier - Grenoble I, 2002. Français
Liste complète des métadonnées


https://tel.archives-ouvertes.fr/tel-00326151
Contributor : Dominique Vaufreydaz <>
Submitted on : Wednesday, October 1, 2008 - 10:49:44 PM
Last modification on : Thursday, October 2, 2008 - 8:59:55 AM
Document(s) archivé(s) le : Friday, June 4, 2010 - 12:05:36 PM

Identifiers

  • HAL Id : tel-00326151, version 1

Collections

UJF

Citation

Dominique Vaufreydaz. Modélisation statistique du langage à partir d'Internet pour la reconnaissance automatique de la parole continue. Interface homme-machine [cs.HC]. Université Joseph-Fourier - Grenoble I, 2002. Français. <tel-00326151>

Share

Metrics

Record views

267

Document downloads

1197