Skip to Main content Skip to Navigation
Conference papers

Automatic and manual clustering for large vocabulary speech recognition: a comparative study

Kamel Smaïli 1 Armelle Brun 1 Imed Zitouni 1 Jean-Paul Haton 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.This article describes a comparative study of language models in which the evaluation protocol has been set by AUPELF-UREF . We especially pay attention on the comparison between two methods of clustering words which are necessary in the design of the corresponding language models. The first classification is done by following a linguistic and theoretical method and the second one is based on an optimization method. Both methods are evaluated through the Shannon game. The vocabulary used is 20 000 words, the training corpus is made of two years of Le Monde newspaper (42M of words) and the test corpus (400 000 words) is extracted from 6 years of Le Monde Diplomatique. First evaluations show an improvement of 13% of recognized words in the first five ranks and a decrease of 25% in perplexity.
Document type :
Conference papers
Complete list of metadata

https://hal.inria.fr/inria-00098977
Contributor : Publications Loria <>
Submitted on : Tuesday, September 26, 2006 - 8:40:57 AM
Last modification on : Friday, February 26, 2021 - 3:28:06 PM

Identifiers

  • HAL Id : inria-00098977, version 1

Collections

Citation

Kamel Smaïli, Armelle Brun, Imed Zitouni, Jean-Paul Haton. Automatic and manual clustering for large vocabulary speech recognition: a comparative study. 6th European Conference on Speech Communication & Technology - EUROSPEECH'99, 1999, Budapest, Hungary, 4 p. ⟨inria-00098977⟩

Share

Metrics

Record views

234