Skip to Main content Skip to Navigation
New interface
Conference papers

A Phonemic Corpus of Polish Child-Directed Speech

Abstract : Recent advances in modeling early language acquisition are due not only to the development of machine-learning techniques, but also to the increasing availability of data on child language and child-adult interaction. In the absence of recordings of child-directed speech, or when models explicitly require such a representation for training data, phonemic transcriptions are commonly used as input data. We present a novel (and to our knowledge, the first) phonemic corpus of Polish child-directed speech. It is derived from the Weist corpus of Polish, freely available from the seminal CHILDES database. For the sake of reproducibility, and to exemplify the typical trade-off between ecological validity and sample size, we report all preprocessing operations and transcription guidelines. Contributed linguistic resources include updated CHAT-formatted transcripts with phonemic transcriptions in a novel phonology tier, as well as by-product data, such as a phonemic lexicon of Polish. All resources are distributed under the LGPL-LR license.
Document type :
Conference papers
Complete list of metadata
Contributor : Luc Boruta Connect in order to contact the contributor
Submitted on : Wednesday, May 30, 2012 - 11:35:46 AM
Last modification on : Thursday, March 17, 2022 - 10:08:24 AM
Long-term archiving on: : Thursday, December 15, 2016 - 10:19:56 AM


Files produced by the author(s)


  • HAL Id : hal-00702437, version 1



Luc Boruta, Justyna Jastrzebska,. A Phonemic Corpus of Polish Child-Directed Speech. LREC 2012 - Eighth International Conference on Language Resources and Evaluation, May 2012, Istanbul, Turkey. ⟨hal-00702437⟩



Record views


Files downloads