Internet Documents: A Rich Source for Spoken Language Modeling

Abstract : Spoken language speech recognition systems need better understanding of natural spoken language phenomenon than their dictation counterparts. Current language models are mostly based on written text and/or very tedious Wizard of Oz or real dialog experiments1. In this paper we propose to use Internet documents as a very rich source of information for spoken language modeling. Through detailed experiments we show how using Internet we could automatically prepare language models adapted to a given task. For a given recognition system using this approach the word accuracy is up to 15% better than a system using language models trained on written text.
Document type :
Conference papers
Complete list of metadatas

Cited literature [5 references]  Display  Hide  Download

https://hal.inria.fr/inria-00326147
Contributor : Dominique Vaufreydaz <>
Submitted on : Wednesday, October 1, 2008 - 10:18:27 PM
Last modification on : Thursday, February 7, 2019 - 5:03:31 PM
Long-term archiving on : Friday, June 4, 2010 - 12:05:05 PM

File

Vaufreydaz99c.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00326147, version 1

Collections

LIG | UGA

Citation

Dominique Vaufreydaz, Mohamad Akbar, José Rouillard. Internet Documents: A Rich Source for Spoken Language Modeling. IEEE Workshop ASRU'99 (Automatic Speech Recognition and Understanding), IEEE, Dec 1999, Keystone - Colorado, United States. pp. 277-281. ⟨inria-00326147⟩

Share

Metrics

Record views

167

Files downloads

178