Skip to Main content Skip to Navigation

Conversion de la voix : Approches et applications

Abstract : Voice conversion (VC) is an important problem in the field of audio signal processing. The goal of voice conversion is to transform the speech signal of a source speaker such that it sounds as if it had been uttered by a target speaker while preserving the same linguistic content of the original signal. Gaussian mixture model (GMM) based conversion is the most commonly used technique in VC, but is often sensitive to overfitting and oversmoothing. To address these issues, we propose a secondary classification by applying a K-means classification in each class obtained by a primary classification in order to obtain more precise local conversion functions. This proposal avoids the need for complex training algorithms because the estimated local mapping functions are determined at the same time. The second contribution of this thesis, includes a new methodology for designing the relationship between two sets of spectral envelopes. Our systems perform by : 1) cascading Deep Neural Networks with Gaussian Mixture Models for constructing DNN-GMM and GMM-DNN-GMM models in order to find an efficient global mapping relationship between the cepstral vectors of the two speakers ; 2) using a new spectral synthesis process with excitation and phase extracted from the target training space encoded as a KD-tree. Experimental results of the proposed methods exhibit a great improvement in intelligibility, quality and naturalness of the converted speech signals when compared with those obtained by a baseline conversion method. The extraction of excitation and phase from the target training space, allows the preservation of target speaker’s identity. Our last contribution of this thesis concerns the implementation of a novel speakingaid system for enhancing esophageal speech (ES). The method adopted in this thesis aims to improve the quality of esophageal speech using a combination of a voice conversion technique and a time dilation algorithm. In the proposed system, a Deep Neural Network (DNN) is used as a nonlinear mapping function for vocal tract vectors conversion. Then the converted frames are used to determine realistic excitation and phase vectors from the target training space using a frame selection algorithm. We demonstrate that that our proposed method provides considerable improvement in intelligibility and naturalness of the converted esophageal stimuli.
Document type :
Complete list of metadata

Cited literature [86 references]  Display  Hide  Download
Contributor : Joseph Di Martino <>
Submitted on : Monday, September 2, 2019 - 2:19:26 PM
Last modification on : Tuesday, June 16, 2020 - 11:28:03 AM
Long-term archiving on: : Thursday, January 9, 2020 - 9:50:18 AM


Files produced by the author(s)


  • HAL Id : tel-02276259, version 1


Imen Ben Othmane. Conversion de la voix : Approches et applications. Traitement du signal et de l'image [eess.SP]. Université de Carthage (Tunisie), 2019. Français. ⟨tel-02276259⟩