Automatic and manual clustering for large vocabulary speech recognition: a comparative study - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 1999

Automatic and manual clustering for large vocabulary speech recognition: a comparative study

Résumé

This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.

Domaines

Autre [cs.OH]
Fichier non déposé

Dates et versions

inria-00098977 , version 1 (26-09-2006)

Identifiants

  • HAL Id : inria-00098977 , version 1

Citer

Kamel Smaïli, Armelle Brun, Imed Zitouni, Jean-Paul Haton. Automatic and manual clustering for large vocabulary speech recognition: a comparative study. 6th European Conference on Speech Communication & Technology - EUROSPEECH'99, 1999, Budapest, Hungary, 4 p. ⟨inria-00098977⟩
116 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More