A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS

Dominique Vaufreydaz; Carole Bergamini; Jean-François Serignat; Laurent Besacier; Mohamad Akbar

Conference Papers Year : 2000

A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS

(1) , (1) , (1) , (1) , (1)

Dominique Vaufreydaz

Function : Author
PersonId : 8656
IdHAL : vaufreydaz
ORCID : 0000-0002-8825-0973
IdRef : 064812596

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Carole Bergamini

Function : Author

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Jean-François Serignat

Function : Author

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Laurent Besacier

Function : Author
PersonId : 1521
IdHAL : laurent-besacier
ORCID : 0000-0001-7411-9125
IdRef : 079377017

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Mohamad Akbar

Function : Author

Equipe GEOD, Groupe d'étude sur l'oral et le dialogue

Abstract

In this paper, a new methodology for speech corpora definition from internet documents is described, in order to record a large speech database, dedicated to the training and testing of acoustic models for speech recognition. In the first section, the Web robot which is in charge of collecting Web pages from Internet is presented, then the web text to French sentences filtering mechanism is explained. Some information about the corpus organization (90% for training and 10% for test) is given. In the third section, the phoneme distribution of the corpus is presented and comparison is made with others French language studies. Finally tools and planning for recording the speech database with more than one hundred speakers are described.

Domains

Computation and Language [cs.CL] Sound [cs.SD]

Fichier principal

Vaufreydaz00.pdf (84.25 Ko)

Origin : Files produced by the author(s)

Dominique Vaufreydaz : Connect in order to contact the contributor

https://inria.hal.science/inria-00326150

Submitted on : Wednesday, October 1, 2008-10:40:23 PM

Last modification on : Thursday, April 4, 2024-9:41:00 PM

Long-term archiving on: Friday, June 4, 2010-12:05:31 PM

Dates and versions

inria-00326150 , version 1 (01-10-2008)

Identifiers

HAL Id : inria-00326150 , version 1

Cite

Dominique Vaufreydaz, Carole Bergamini, Jean-François Serignat, Laurent Besacier, Mohamad Akbar. A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS. LREC'2000 (Language Resources & Evaluation international Conference), Jun 2000, Athens, Greece. pp. 423-426. ⟨inria-00326150⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG POLYTECH-GRENOBLE LIG_SIDCH

1253 View

755 Download

A NEW METHODOLOGY FOR SPEECH CORPORA DEFINITION FROM INTERNET DOCUMENTS

Abstract

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Share