A Phonemic Corpus of Polish Child-Directed Speech

Abstract : Recent advances in modeling early language acquisition are due not only to the development of machine-learning techniques, but also to the increasing availability of data on child language and child-adult interaction. In the absence of recordings of child-directed speech, or when models explicitly require such a representation for training data, phonemic transcriptions are commonly used as input data. We present a novel (and to our knowledge, the first) phonemic corpus of Polish child-directed speech. It is derived from the Weist corpus of Polish, freely available from the seminal CHILDES database. For the sake of reproducibility, and to exemplify the typical trade-off between ecological validity and sample size, we report all preprocessing operations and transcription guidelines. Contributed linguistic resources include updated CHAT-formatted transcripts with phonemic transcriptions in a novel phonology tier, as well as by-product data, such as a phonemic lexicon of Polish. All resources are distributed under the LGPL-LR license.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00702437
Contributor : Luc Boruta <>
Submitted on : Wednesday, May 30, 2012 - 11:35:46 AM
Last modification on : Friday, September 6, 2019 - 8:28:01 PM
Long-term archiving on : Thursday, December 15, 2016 - 10:19:56 AM

Files

final.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00702437, version 1

Collections

Citation

Luc Boruta, Justyna Jastrzebska,. A Phonemic Corpus of Polish Child-Directed Speech. LREC 2012 - Eighth International Conference on Language Resources and Evaluation, May 2012, Istanbul, Turkey. ⟨hal-00702437⟩

Share

Metrics

Record views

399

Files downloads

419