Skip to Main content Skip to Navigation
Conference papers

Internet Documents: A Rich Source for Spoken Language Modeling

Dominique Vaufreydaz 1 Mohamad Akbar 1 José Rouillard 1
1 CLIPS-IMAG - Equipe GEOD, Groupe d'étude sur l'oral et le dialogue
LIG [2007-2015] - Laboratoire d'Informatique de Grenoble [2007-2015]
Abstract : Spoken language speech recognition systems need better understanding of natural spoken language phenomenon than their dictation counterparts. Current language models are mostly based on written text and/or very tedious Wizard of Oz or real dialog experiments1. In this paper we propose to use Internet documents as a very rich source of information for spoken language modeling. Through detailed experiments we show how using Internet we could automatically prepare language models adapted to a given task. For a given recognition system using this approach the word accuracy is up to 15% better than a system using language models trained on written text.
Document type :
Conference papers
Complete list of metadatas

Cited literature [5 references]  Display  Hide  Download

https://hal.inria.fr/inria-00326147
Contributor : Dominique Vaufreydaz <>
Submitted on : Wednesday, October 1, 2008 - 10:18:27 PM
Last modification on : Friday, July 17, 2020 - 11:10:21 AM
Long-term archiving on: : Friday, June 4, 2010 - 12:05:05 PM

File

Vaufreydaz99c.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00326147, version 1

Collections

CNRS | LIG | UGA

Citation

Dominique Vaufreydaz, Mohamad Akbar, José Rouillard. Internet Documents: A Rich Source for Spoken Language Modeling. IEEE Workshop ASRU'99 (Automatic Speech Recognition and Understanding), IEEE, Dec 1999, Keystone - Colorado, United States. pp. 277-281. ⟨inria-00326147⟩

Share

Metrics

Record views

195

Files downloads

234