Uncertainty learning for noise robust ASR

Dung Tien Tran

Résumé

This thesis focuses on noise robust automatic speech recognition (ASR). It includes two parts. First, we focus on better handling of uncertainty to improve the performance of ASR in a noisy environment. Second, we present a method to accelerate the training process of a neural network using an auxiliary function technique. In the first part, multichannel speech enhancement is applied to input noisy speech. The posterior distribution of the underlying clean speech is then estimated, as represented by its mean and its covariance matrix or uncertainty. We show how to propagate the diagonal uncertainty covariance matrix in the spectral domain through the feature computation stage to obtain the full uncertainty covariance matrix in the feature domain. Uncertainty decoding exploits this posterior distribution to dynamically modify the acoustic model parameters in the decoding rule. The uncertainty decoding rule simply consists of adding the uncertainty covariance matrix of the enhanced features to the variance of each Gaussian component. We then propose two uncertainty estimators based on fusion to nonparametric estimation, respectively. To build a new estimator, we consider a linear combination of existing uncertainty estimators or kernel functions. The combination weights are generatively estimated by mini- mizing some divergence with respect to the oracle uncertainty. The divergence measures used are weighted versions of Kullback-Leibler (KL), Itakura-Saito (IS), and Euclidean (EU) diver- gences. Due to the inherent nonnegativity of uncertainty, this estimation problem can be seen as an instance of weighted nonnegative matrix factorization (NMF). In addition, we propose two discriminative uncertainty estimators based on linear or nonlin- ear mapping of the generatively estimated uncertainty. This mapping is trained so as to maxi- mize the boosted maximum mutual information (bMMI) criterion. We compute the derivative of this criterion using the chain rule and optimize it using stochastic gradient descent. In the second part, we introduce a new learning rule for neural networks that is based on an auxiliary function technique without parameter tuning. Instead of minimizing the objective function, this technique consists of minimizing a quadratic auxiliary function which is recursively introduced layer by layer and which has a closed-form optimum. Based on the properties of this auxiliary function, the monotonic decrease of the new learning rule is guaranteed.

Cette th`ese se focalise sur la reconnaissance automatique de la parole (RAP) robuste au bruit. Elle comporte deux parties. Premi`erement, nous nous focalisons sur une meilleure prise en compte des incertitudes pour am ́eliorer la performance de RAP en environnement bruit ́e. Deuxi`emement, nous pr ́esentons une m ́ethode pour acc ́el ́erer l’apprentissage d’un r ́eseau de neurones en utilisant une fonction auxiliaire. Dans la premi`ere partie, une technique de rehaussement multicanal est appliqu ́ee `a la parole bruit ́ee en entr ́ee. La distribution a posteriori de la parole propre sous-jacente est alors estim ́ee et repr ́esent ́ee par sa moyenne et sa matrice de covariance, ou incertitude. Nous montrons comment propager la matrice de covariance diagonale de l’incertitude dans le domaine spectral a travers le calcul des descripteurs pour obtenir la matrice de covariance pleine de l’incertitude sur les descripteurs. Le d ́ecodage incertain exploite cette distribution a posteriori pour modifier dynamiquement les param`etres du mod`ele acoustique au d ́ecodage. La r`egle de d ́ecodage consiste simplement `a ajouter la matrice de covariance de l’incertitude `a la variance de chaque gaussienne. Nous proposons ensuite deux estimateurs d’incertitude bas ́es respectivement sur la fusion et sur l’estimation non-param ́etrique. Pour construire un nouvel estimateur, nous consid ́erons la combinaison lin ́eaire d’estimateurs existants ou de fonctions noyaux. Les poids de combinaison sont estim ́es de fa ̧con g ́en ́erative en minimisant une mesure de divergence par rapport a l’incertitude oracle. Les mesures de divergence utilis ́ees sont des versions pond ́er ́ees des divergences de Kullback-Leibler (KL), d’Itakura-Saito (IS) ou euclidienne (EU). En raison de la positivit ́e inh ́erente de l’incertitude, ce probl`eme d’estimation peut ˆetre vu comme une instance de factorisation matricielle positive (NMF) pond ́er ́ee. De plus, nous proposons deux estimateurs d’incertitude discriminants bas ́es sur une transformation lin ́eaire ou non-lin ́eaire de l’incertitude estim ́ee de fa ̧con g ́en ́erative. Cette transformation est entraˆın ́ee de sorte `a maximiser le crit`ere de maximum d’information mutuelle boost ́e (bMMI). Nous calculons la d ́eriv ́ee de ce crit`ere en utilisant la r`egle de d ́erivation en chaˆıne et nous l’optimisons par descente de gradient stochastique. Dans la seconde partie, nous introduisons une nouvelle m ́ethode d’apprentissage pour les r ́eseaux de neurones bas ́ee sur une fonction auxiliaire sans aucun r ́eglage de param`etre. Au lieu de maximiser la fonction objectif, cette technique consiste `a maximiser une fonction auxiliaire qui est introduite de fa ̧con r ́ecursive couche par couche et dont le minimum a une expression analytique. Grˆace aux propri ́et ́es de cette fonction, la d ́ecroissance monotone de la fonction objectif est garantie.

Uncertainty learning for noise robust ASR

Traitement de l’incertitude pour la reconnaissance de la parole robuste au bruit

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager