A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS

Dominique Vaufreydaz; Carole Bergamini; Jean-François Serignat; Laurent Besacier; Mohamad Akbar

Communication Dans Un Congrès Année : 2000

A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS

(1) , (1) , (1) , (1) , (1)

Dominique Vaufreydaz

Fonction : Auteur
PersonId : 8656
IdHAL : vaufreydaz
ORCID : 0000-0002-8825-0973
IdRef : 064812596

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Carole Bergamini

Fonction : Auteur

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Jean-François Serignat

Fonction : Auteur

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Laurent Besacier

Fonction : Auteur
PersonId : 1521
IdHAL : laurent-besacier
ORCID : 0000-0001-7411-9125
IdRef : 079377017

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Mohamad Akbar

Fonction : Auteur

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Résumé

In this paper, a new methodology for speech corpora definition from internet documents is described, in order to record a large speech database, dedicated to the training and testing of acoustic models for speech recognition. In the first section, the Web robot which is in charge of collecting Web pages from Internet is presented, then the web text to French sentences filtering mechanism is explained. Some information about the corpus organization (90% for training and 10% for test) is given. In the third section, the phoneme distribution of the corpus is presented and comparison is made with others French language studies. Finally tools and planning for recording the speech database with more than one hundred speakers are described.

Domaines

Informatique et langage [cs.CL] Son [cs.SD]

Fichier principal

Vaufreydaz00.pdf (84.25 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Dominique Vaufreydaz : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00326150

Soumis le : mercredi 1 octobre 2008-22:40:23

Dernière modification le : jeudi 4 avril 2024-21:41:00

Archivage à long terme le : vendredi 4 juin 2010-12:05:31

Dates et versions

inria-00326150 , version 1 (01-10-2008)

Identifiants

HAL Id : inria-00326150 , version 1

Citer

Dominique Vaufreydaz, Carole Bergamini, Jean-François Serignat, Laurent Besacier, Mohamad Akbar. A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS. LREC'2000 (Language Resources & Evaluation international Conference), Jun 2000, Athens, Greece. pp. 423-426. ⟨inria-00326150⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG POLYTECH-GRENOBLE LIG_SIDCH

1253 Consultations

755 Téléchargements

A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager