Facial Expressions Recognition: Development and Application to HMI

. We present in this paper, a facial expressions recognition system to command a mobile robot (Pionner-3DX). The proposed system mainly consists of two modules: facial expression recognition and robot command. The first module aims to recognize the facial expressions like happiness, sadness, surprise, anger, fear, disgust and neutral using Gradient Vector Flow (GVF) snake to find ROI (Region Of Interest like: mouth, eyes, eyebrow) segmentation from FEEDTUM database (video file). While the second module, analyses the segmented ROI to recognize with Euclidian distance calculation (compatible with the MPEG-4 description of the six universal emotions) and Time Delay Neural Network classifier. Finally, the recognized facial expressions were used as con-trol commands for the mobile robot displacement (forward; backward; turn left; turn right) in ROS (Robot Operating System).


Introduction
Automatic analysis of facial expressions constitutes an important tool for research in human machine interaction (HMI).Autonomous robot control systems are complex systems that consist of a sensor, a decision-making control system and a motor drive system.The sensor can be either a visual system, a speech system, or a manual control system [1].In [2], [3], numerous systems are proposed to control a robot or a wheelchair using head or face movement.Such systems involve body movement and are not suitable for people with extreme physical disabilities where head or face movement is difficult.Speech controlled systems [4] are also not suitable for people with speech disability.Thus, current research has been focused on design of systems, which can be a good solution to these problems.The best alternative is to design a system where command is derived from recognizing the user's facial expressions like happiness, sadness, surprise, anger and neutral.
Recently, many works were about the automatic localization of the face and its characteristic features.Indeed, the objective is to develop some interactive systems capable to analyze and to interpret the user's behavior.Several works have been made in this sense.The methods described in [5], [6] use models of facial movements, and those of [7] and [8] the methods of classification by neural networks.
We already developed a FER (Facial Expression Recognition) system in the static case (JAFFE database) with MLP neural network, trained by the backpropagation of gradient algorithm.We used data extracted by a local method to model the facial expressions.Thus introducing the geometric coordinates of 19 points characterizing the ROI manually as made in [9].These extracted data are normalized, so that the facial expressions remain invariant for either the change of scale or slant of the individual head.The performances of our recognition system, have been established on 4 facial expressions "neutral, joy, surprise, fear", among the seven universal expressions, because there is no very big difference (in geometric side) between neutral and fear expressions, and between joy and surprise expressions [10].
We have developed another FER system in the dynamic case (FEDTUM database [11]) with Time Delay Neural Network classifier, trained by the backpropagation of gradient algorithm.In this system, we use GVF snake and the Euclidian distance calculation (compatible with the MPEG-4 description of the six universal emotions) manually and we interpret this analysis in articular movements of an arm manipulator robot "Mentor 5-dof" in the setting development of HMI application [12].
In this article, we present our FER system (the dynamic case) with some improvements in the modeling, the analysis and the interpretation of the facial expressions from image sequences, to command the mobile robot displacement (forward; backward; turn left; turn right) in ROS (Robot Operating System).

Facial Expression Modeling
Features extraction is the crucial and complex part in any shapes recognition system.It often uses the results of the statistics theory, of evaluation, to get a transformation from the representation space toward the space of interpretation.The major problem of this part can be seen as the resolution of a two sub-problems: ─ What measures must be made?─ What features from row data to use as input?
In our case, input data are features of facial expressions modeling frames.The facial expression is measured by the temporary non-rigid distortion (0.25-5s) of the facial features (eyebrows, eyes, mouth).We have used the Gradient Vector Flow snake algorithm to extract the deformable and non-rigid measurement.However, this algorithm presents some limitations: it is only used for binary images, selects manually the initial snake and extracts only one geometric object per images.We propose to modify the GVF algorithm in a way to overcome those limitations.Our method is described in section 2.1.
The FEEDTUM database was chosen because it is of the most used databases in facial expression recognition.It consists of elicited spontaneous emotions of 18 subjects within the MPEG-4 emotion-set plus added neutrality.
Kass and al. [13] have introduced the active contours or snakes.Since, numerous variants of these deformable models have been studied for multiple applications.Their utility was particularly well illustrated in medical imagery, but also in electronic surveillance domain, and spatio-temporal tracking in video [14].
The GVF method has been developed to propose a solution to some limitations of the approach as the initialization of the snake and its convergence toward the concave regions [15].

GVF Field
The GVF method proceeds in two stages to calculate the GVF field: -Calculate the gradient of image -Calculate the gradient vector flow The classical snake model proposed by Kass et al., defines the active contour as a parametric curve, r(s) = (x(s); y(s)), that moves in the spatial domain until the energy functional in Eq. 1 reaches its minimum value.
Eint and Eext represent the internal and external energy, respectively.The internal energy enforces smoothness along the contour.A common internal energy function is defined as follows: Where α and β are weighting parameters, rˊ and r˝ are the first and second derivative of r(s) with respect to s.The first term, also known as tension energy, prevents the snake to remain attracted to isolated points.The second term, known as bending energy, prevents the contour of developing sharp angles.Constraints based on more complex shape models, such as Fourier descriptors.
The external energy is derived from the image, so that the snake will be attracted to features of interest.Given a gray level image I(x;y) , a common external energy is defined as: Where ∇ is the gradient operator,   (, ) a 2D Gaussian kernel with standard deviation σ and where * is the convolution operator.Minimizing the energy function of eq. 1 results in solving the following associated Euler-Lagrange equations: This can be seen as a force balance equation: These equations can be solved using gradient descent by considering r(s) as a function of time, i.e. r(s; t).The partial derivative of r with respect to t is then

Gradient Vector Flow
The external force field defined in the previous section requires a good initialization [15], close to the object boundary, in order to segment the object.This limitation is caused by the nature of the external force field, whose vectors point towards the object only in the proximity of the object's boundary.As we move away from the boundary the external fields rapidly become zero, therefore reducing the possibilities that a contour located in such regions will converge correctly.To overcome this problem, Xu and Prince [16] proposed another external force field v(x;y) = (u(x;y);v(x;y)).This vector field minimizes the following energy functional: Where  is a nonnegative parameter expressing the degree of smoothness of the field  and where  is an edge map, e.g.|∇|.The first term in Eq.7 keeps the field v smooth, whereas the second term forces the field v to resemble the original edge force in the neighborhood of edges.This new external force is called gradient vector flow (GVF) field.The GVF-field can be found by solving the following associated Euler-Lagrange equations: Where ∇ 2 is the Laplacian operator.

Semi-Automatic Initialization of GVF Snake
We have automated the initialization procedure of the snake by a geometric shape (ellipsis) that corresponds in its shape to the region of interest (the eyes), in the goal to give solution for the number of initial snake points problem, (see Fig. 1).
Fig. 1.Geometric shape of the initial snake.
Where (x0,y0) defines the region of interest center coordinates, (x,y) are the initial snake coordinates, a and b are respectively the width and the height of the ellipsis, t represents the step displacement of the snake.We modify the initial snake so that is initialized automatically (see Fig. 2. a), with the GVF snake parameters: α=0.2, β=0, γ=1, τ=2, iterMax=5.The snake could converge toward the zone of concavity during 250 iterations, from a snake automatically initialized without any user intervention.

Superposition of the GVF Field and the GVF Snake
The second idea that proved very interesting to achieve, is to superpose the GVF snake and the GVF field as shown in (Fig. 3. a) in the goal to understand why the snake chose to converge toward this result.We notice then, that the snake of (Fig. 3. c) converged well toward the ROI in the GVF sense because it perfectly integrates the zone of the potential biggest GVF.
We achieved two strategies of attributes extraction by GVF snake from video frames using FEEDTUM database.These strategies are spatio-temporal extraction and temporal extraction.

Spatio-Temporal Extraction Of The Attributes
The spatio-temporal extraction, consist in making a spatial extraction of eyes contour by GVF snake, for each frame regardless of the others (see Fig. 4).The result of temporal extraction is defined by the set of results of all frames sequence, that we created while gathering pictures of three expressions (neutral, disgust, joy) by the software Ulead Medium Studio Pro 7.0, in order to test the performances of the GVF snake.Temporal treatment t+n

Temporal Extraction of the Attributes
This extraction, consist in initializing snake of the first frame, then the resulting contour of frame t is used as an initialization of the contour of frame t +1 (see Fig. 5).The GVF force should range as far as the ROI can move between two subsequent frames as:

Application
We applied our GVF snake algorithm, using the spatio-temporal extraction strategy described above.As seen in (Fig. 6), we initialized three snakes to segment the regions of the eyes, the eyebrows, and the mouth.These snakes progress at the same time and by the same parameters.Let us note that it was very difficult to find the adequate regulating parameters of the three snakes.

Modeling the ROI by MPEG-4 Description
We exploited the present information in the skeletons of the sequences of segmented pictures by the GVF snake (Fig. 7).We took the work in [17] as a basis, according to the description of the MPEG-4 norm of the six universal expressions (Expressions → translation of these descriptions) (see Fig.Where  ′ and  ′ are the coordinates of new origin.  ,   ,   ,   are respectively coordinates of right and left eyes (see Fig. 9).
With,  ,  are coordinates of points of interest in old landmark (, , ).The facial expressions remain invariant for either the change of scale or slant of the individual head.

Analysis Of Distances Features
To recognize the facial expressions from video sequences, time delay neural network classifier is used with the 5 feature distances in inputs layer, and the 7 expressions in output layer trained by the backpropagation gradient algorithm (with 90% of good recognition).Once the facial expressions are recognized, the corresponding control signals are sent to a real mobile robot (Pionner-3DX) using serial port via the Robot Operating System (establishing communication) which makes the robot move forward, backward, left, right.

U U
The control signals are a sequence of the distance features that define a scenario displacement of robot corresponding to recognized facial expressions like: Where Ti defines thresholds values deduced from analyzing the features distances evolution in all facial expressions recognized of an individual to the database (see Fig. 10).An example of scenarios that takes place in the Intelligent Systems Research Laboratory (LARESI) is seen in (Fig. 11): forward -> forward -> turn right -> forward > turn left -> forward, corresponding to: D5, D5, D1, D5, D2, D5, D5.We could check that, the robot reacts to all these commands successfully with 100% accuracy.

Conclusion
In this paper, the command of mobile robot using facial expressions, is the second application that we achieved in the setting of human machine interaction, intended especially to people having a reduced mobility (motor handicap) and that have only face movements to express their intents.Moreover, we conceived a dynamic facial expression model using active contour segmentation (GVF snake) by following a temporal extraction of the regions of interests.Then we got the 5 distance features while applying the MPEG-4 norm on the result of extraction automatically.These distance features are sent as signal command via ROS to a mobile robot.The results are very encouraging to achieve this application in real world from live video, like intelligent wheelchairs, human computer interaction and security systems.
the snake stabilizes, i.e. when an optimum is found, the terms d(,)  and d(,)  vanish.

Fig. 6 .
Fig. 6.Extraction of the ROI of FEEDTUM database by GVF snake.
8. a).We extract automatically points of interest from segmented ROI to calculate the 5 Euclidian distances (see Fig. 8. b).

Fig. 8 .
a) The facial expression model by MPEG-4 description.b) Automatic extraction of points of interest from GVF snake results.The neuter is represented by five distances takes as reference.For the rest of the expressions their description will be a combination of comparison of their distances in relation to those of the neutral expression.That, every facial expression (in our case, video sequence = 4 frames) is represented by four normalized features vectors.We have applied this normalization procedure:The change of axis (O, X, Y) linked to image, to (O', Uh, Uv) linked to face is established as follows:

Fig. 11 .
Fig. 11.Mobile robot command by the facial expression recognition system.