A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2000

A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS

Résumé

In this paper, a new methodology for speech corpora definition from internet documents is described, in order to record a large speech database, dedicated to the training and testing of acoustic models for speech recognition. In the first section, the Web robot which is in charge of collecting Web pages from Internet is presented, then the web text to French sentences filtering mechanism is explained. Some information about the corpus organization (90% for training and 10% for test) is given. In the third section, the phoneme distribution of the corpus is presented and comparison is made with others French language studies. Finally tools and planning for recording the speech database with more than one hundred speakers are described.
Fichier principal
Vignette du fichier
Vaufreydaz00.pdf (84.25 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

inria-00326150 , version 1 (01-10-2008)

Identifiants

  • HAL Id : inria-00326150 , version 1

Citer

Dominique Vaufreydaz, Carole Bergamini, Jean-François Serignat, Laurent Besacier, Mohamad Akbar. A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS. LREC'2000 (Language Resources & Evaluation international Conference), Jun 2000, Athens, Greece. pp. 423-426. ⟨inria-00326150⟩
1253 Consultations
755 Téléchargements

Partager

Gmail Facebook X LinkedIn More