hal-00639123, version 1
Clustering repeated Out-Of-Vocabulary word tokens in order to model them for broadcast news transcription
The XIVth International Conference Speech and Computer - SPECOM'2011 (2011) 73-80
Abstract: This paper describes a novel technique to detect repeated Out-Of-Vocabulary (OOV) word tokens in a LVCSR system used for transcribing broadcast news speech data. The system is composed of two blocks: an OOV word detector and a clustering module working on the detected OOV word segments. This combination allows a more reliable detection of repeated OOV words. In the paper we focus our attention on the second part of the system i.e. the clustering algorithm. This algorithm is based on the estimation of the entropy. The proposed clustering algorithm gives better performance than a classical baseline incremental clustering algorithm based on a distance threshold for OOV word tokens classification. Furthermore, the combination of OOV word token detector and clustering OOV segments can achieve better precision compared to the use of the OOV word detector alone.
- 1:
- INRIA – CNRS : UMR7503 – Université Henri Poincaré - Nancy I – Université Nancy II – Institut National Polytechnique de Lorraine (INPL)
- Domain : Computer Science/Human-Computer Interaction
- hal-00639123, version 1
- http://hal.archives-ouvertes.fr/hal-00639123
- oai:hal.archives-ouvertes.fr:hal-00639123
- From:
- Submitted on: Tuesday, 8 November 2011 12:06:36
- Updated on: Thursday, 10 November 2011 10:07:22


Export