HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

A Machine Learning Based Approach for Vocabulary Selection for Speech Transcription

Denis Jouvet 1 David Langlois 1
1 PAROLE - Analysis, perception and recognition of speech
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : This paper introduces a new approach based on neural networks for selecting the vocabulary to be used in a speech transcription system. Indeed, nowadays, large sets of text data can be collected from web sources, and used in addition to more traditional text sources for building language models for speech transcription systems. However, web data sources lead to large amounts of heterogeneous data, and, as a consequence, standard vocabulary selection procedures based on unigram approaches tend to select unwanted and undesirable items as new words. As an alternative to unigram-based and empirical manual-based selection approaches, this paper proposes a new selection procedure that relies on a machine learning technique, namely neural networks. The paper presents and discusses the results obtained with the various selection procedures. The neural network based selection experiments are promising and they can handle automatically various detailed information in the selection process.
Complete list of metadata

Contributor : Denis Jouvet Connect in order to contact the contributor
Submitted on : Friday, June 14, 2013 - 4:22:08 PM
Last modification on : Wednesday, February 2, 2022 - 5:00:57 PM


  • HAL Id : hal-00834302, version 1


Denis Jouvet, David Langlois. A Machine Learning Based Approach for Vocabulary Selection for Speech Transcription. TSD - 16th International Conference on Text, Speech and Dialogue - 2013, Sep 2013, Pilsen, Czech Republic. pp.60-67. ⟨hal-00834302⟩



Record views