A Phonemic Corpus of Polish Child-Directed Speech

Abstract : Recent advances in modeling early language acquisition are due not only to the development of machine-learning techniques, but also to the increasing availability of data on child language and child-adult interaction. In the absence of recordings of child-directed speech, or when models explicitly require such a representation for training data, phonemic transcriptions are commonly used as input data. We present a novel (and to our knowledge, the first) phonemic corpus of Polish child-directed speech. It is derived from the Weist corpus of Polish, freely available from the seminal CHILDES database. For the sake of reproducibility, and to exemplify the typical trade-off between ecological validity and sample size, we report all preprocessing operations and transcription guidelines. Contributed linguistic resources include updated CHAT-formatted transcripts with phonemic transcriptions in a novel phonology tier, as well as by-product data, such as a phonemic lexicon of Polish. All resources are distributed under the LGPL-LR license.
Type de document :
Communication dans un congrès
LREC 2012 - Eighth International Conference on Language Resources and Evaluation, May 2012, Istanbul, Turkey. 2012
Liste complète des métadonnées

https://hal.inria.fr/hal-00702437
Contributeur : Luc Boruta <>
Soumis le : mercredi 30 mai 2012 - 11:35:46
Dernière modification le : jeudi 11 janvier 2018 - 06:19:18
Document(s) archivé(s) le : jeudi 15 décembre 2016 - 10:19:56

Fichiers

Identifiants

  • HAL Id : hal-00702437, version 1

Collections

INRIA | EHESS | LSCP | PSL | USPC

Citation

Luc Boruta, Justyna Jastrzebska,. A Phonemic Corpus of Polish Child-Directed Speech. LREC 2012 - Eighth International Conference on Language Resources and Evaluation, May 2012, Istanbul, Turkey. 2012. 〈hal-00702437〉

Partager

Métriques

Consultations de la notice

284

Téléchargements de fichiers

269