Low latency and tight resources viseme recognition from speech using an artificial neural network

Nathan Souviraà-Labastie; Frédéric Bimbot

Rapport (Rapport De Recherche) Année : 2013

Low latency and tight resources viseme recognition from speech using an artificial neural network

(1) , (1)

Nathan Souviraà-Labastie

Fonction : Auteur

Speech and sound data modeling and processing

Frédéric Bimbot

Fonction : Auteur
PersonId : 830967

Speech and sound data modeling and processing

Résumé

We present a speech driven real-time viseme recognition system based on Artificial Neural Network (ANN). A Multi-Layer Perceptron (MLP) is used to provide a light and responsive framework, adapted to the final application (i.e., the animation of the lips of an avatar on multi-task platforms with embedded resources and latency constraints). Several improvements of this system are studied such as data selection, network size, training set size, or choice of the best acoustic unit to recognize. All variants are compared to a baseline system, and the combined improvements achieve a recognition rate of 64.3% for a set of 18 visemes and 70.8% for 9 visemes. We then propose a tradeoff system between the recognition performance, the resource requirements and the latency constraints. A scalable method is also described.

Ce rapport présente un système de reconnaissance de visèmes à partir du signal de parole utilisant un réseau de neurones artificiels et capable de fonctionner en temps réel. Un Multi-Layer Perceptron (MLP) permet d'obtenir une méthode rapide et légère adaptée à l'application finale (i.e., l'animation des lèvres d'un avatar par une plateforme multitâche de type set-top-box avec des contraintes de ressources et de latence). Plusieurs améliorations de ce système sont également présentées telles que la sélection des données d'apprentissage, la taille du réseau, la taille de la base d'apprentissage ou encore le choix de l'unité acoustique à reconnaître. Toutes ces variantes sont comparées au système de base. La combinaison de toutes ces améliorations permet d'atteindre un taux de reconnaissance de 64.3% pour un jeu de 18 visèmes et 70.8% pour 9 visèmes. Nous proposons ensuite un système faisant le compromis entre performance, besoin en ressources et latence. Une variante adaptable (scalable) est aussi décrite.

Domaines

Son [cs.SD]

Fichier principal

RR-8338.pdf (409.58 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Nathan Souviraà-Labastie : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00848629

Soumis le : vendredi 26 juillet 2013-16:30:08

Dernière modification le : vendredi 24 mars 2023-14:52:57

Archivage à long terme le : dimanche 27 octobre 2013-03:20:11

Dates et versions

hal-00848629 , version 1 (26-07-2013)

Identifiants

HAL Id : hal-00848629 , version 1

Citer

Nathan Souviraà-Labastie, Frédéric Bimbot. Low latency and tight resources viseme recognition from speech using an artificial neural network. [Research Report] RR-8338, INRIA. 2013. ⟨hal-00848629⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA INRIA-RRRT IRISA-D5 INRIA2 LARA UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UR1-MATH-NUM

191 Consultations

259 Téléchargements

Low latency and tight resources viseme recognition from speech using an artificial neural network

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager