As stated before, the perceived emotion from instantaneous facial expressions is based on the displacement of a particle over a surface, subject to velocity changes proportional to the current probability of each emotion, at every moment, detected by raw sensors.
The instantaneous particle’s velocity is determined by Eq. (1).
$$\begin{aligned} \overrightarrow{V}_\mathrm{p} =\overrightarrow{V}_\mathrm{s} +\sum \limits _{a=1}^N \overrightarrow{V}_\mathrm{a} , \end{aligned}$$
(1)
where \(\overrightarrow{V}_\mathrm{p} \) particle velocity, \(\overrightarrow{V}_\mathrm{s} \) sliding velocity, parallel to DES’ gradient at the current position, \(\overrightarrow{V}_\mathrm{a}\) velocity towards each attractor, always tangent to the DES.
Consider, as an example, the two-dimensional case where the detectable emotions are Happiness and Sadness, shown in Fig. 4.
The example demonstrates some key aspects of DES. The attractors for Happiness and Sadness are placed at (\(\infty \), 0) and (\(-\infty \), 0), respectively. When the raw sensor detects some probability or intensity of an emotion, this signal is interpreted as a velocity along the trajectory towards the correspondent attractor and the particle moves along the emotional curve. In the absence of emotional facial expressions, the particle slides to the local minimum. In this example, one may infer the emotional state of the speaker observing the position of the particle along the \(X\) axis.
The DES concept extends this example by defining a surface or even a hypersurface over which attractors representing the modeled emotions are placed. The relationship between a particle’s position and emotional classification is also defined. The idea of an emotional surface, as shown in Fig. 5 [11, 24], has been proposed by psychologists to discuss someone’s internal (appraised) emotion state trajectories; in this paper, it is used to detect the overall perceived emotion during a man-machine interaction.
The DES concept also differs from Zeeman’s model on presenting the emotions as attractors positioned on the XY plane instead of attributing them to the axes themselves.
A DES in a 3D space is defined as Eq. (2).
$$\begin{aligned} \gamma ( {x,y})=(x, y, f(x,y)) \end{aligned}$$
(2)
The velocity in the direction of each attractor, \(\overrightarrow{V}_\mathrm{a} \), is proportional to the probability of each emotion as detected by existing software such as eMotion and it is tangent to the surface. It is defined as Eq. (3).
$$\begin{aligned} V_\mathrm{a} =F_\mathrm{a} \frac{\nabla \gamma ( {x,y})}{\vert \nabla \gamma ({x,y})\vert } \end{aligned}$$
(3)
where \(F_\mathrm{a}\) is the filtered signal associated with the attractor’s emotion.
It should be noted that the frame-by-frame approach used by the raw sensors does not take into account the continuous natural facial movements and the transitions between expressions. As shown in Fig. 3, a filtering process is applied to raw sensor outputs prior to DES calculations.
The analysis of multimodal realistic videos must account for different noise sources in the process and its observation. Unexpected camera and head motions, face deformation due to speech, CCD performance and minor light source variations result in intrinsically noisy data. Besides, low-pass filtering is necessary because the slow conveyed emotions are to be detected. Both Kalman filtering and moving-average filtering were tested, as presented in Sect. 5.3.
Due to these requirements, a Kalman filter is a natural candidate. Kalman filtering is a well-established technique for linear systems subject to zero mean Gaussian noise both in the process and the sensorial acquisition. There is no empirical evidence to support these hypotheses for the problem of emotional expression analysis. However, it was assumed, due to the complexity and apparent randomness of movements, that muscular facial deformations due to speech and light variations are in the scene. The rationale presented is, thus, the central limit theorem. Filtering convergence during the experiments gave further support for this assumption.
The use of Kalman filters requires the selection of underlying linear models for the update phase. It is proposed that a well-tuned first order system, as in Eqs. (4) and (5), doubles as the filter’s internal update mechanism and low-pass filter. Filtering output for each emotion is described as \(F_\mathrm{a}\) and used in Eq. (3).
$$\begin{aligned}&\dot{x}_{s} =x_{s} ,\end{aligned}$$
(4)
$$\begin{aligned}&F_\mathrm{a} =y=\frac{Kx_{s} }{\tau }, \end{aligned}$$
(5)
where \(K\) System’s gain, \(\tau \) System’s time constant, \(x_{s}\) State variable, \(y\) Filter output.
The Kalman filtering equations are thus written as follows:
Predict:
$$\begin{aligned}&x_{s,t} =x_{s,\,t-1} ,\end{aligned}$$
(6)
$$\begin{aligned}&p=p+\frac{w}{\tau ^2}, \end{aligned}$$
(7)
where \(x_{s,t}\) Current \(x\) value, \(x_{s,\,t-1}\)\(x\) value in last instant estimation, \(w\) Covariance of the process noise, \(N(0,w)\), \(p\) Covariance of \(x_t \), \(N(0,\,p)\).
Update:
$$\begin{aligned}&m=\frac{\frac{pK}{\tau }}{p\left({\frac{K}{\tau }}\right)^2+v},\end{aligned}$$
(8)
$$\begin{aligned}&x_{s,t} =x_{s,t} +m ( {r_t -y_t }),\end{aligned}$$
(9)
$$\begin{aligned}&p=\left({1-\frac{mK}{\tau }}\right) p, \end{aligned}$$
(10)
where \(m\) Residual covariance, \(v\) Covariance of observation noise, \(N(0,v)\), \(r_t\) Current reading from facial expression analysis software, \(y_t\) Current filter output.
The estimation process has two steps. First, the filter runs prediction using a proper time step. If there is raw sensor information for that timestamp, it runs the update phase. One may notice that the state variable \(x_s \) represents only an internal calculated value. The proposed filtering relies only on readings from facial expression analysis software to calculate the internal state of the system.
Lastly we propose a simulation–optimization heuristic to tune system filters’ \(w\) and \(v\) parameters. It employs Simulated Annealing (SA) to determine a set of parameters to minimize an energy function related to the error on classification. The simulation phase is comprised of a round of video analysis based on the current proposed parameters and is used to calculate a global energy value, the optimization phase is further discussed.
Defining vectors for process noise \((Q_{n})\) and observation noise \((R_{n})\) as follows:
$$\begin{aligned} Q_{n}&= \left[ {w_\mathrm{happiness},w_\mathrm{sadness} ,w_\mathrm{anger} ,\,w_\mathrm{fear} } \right],\end{aligned}$$
(11)
$$\begin{aligned} R_{n}&= \left[ {v_\mathrm{happiness},v_\mathrm{sadness} ,v_\mathrm{anger} ,v_\mathrm{fear} } \right]. \end{aligned}$$
(12)
Then defining a starting temperature \((T_0)\) and a cooling constant \(K_t <1\):
$$\begin{aligned} T_{n+1} =K_t T_n . \end{aligned}$$
(13)
The process iterates until the system’s temperature matches room temperature \((T_\mathrm{room})\). One may calculate the number of steps using Eq. (14):
$$\begin{aligned} N_\mathrm{SA steps} =\mathrm{ceil}\left({\log _{K_t } \frac{T_0 }{T_\mathrm{room} }}\right). \end{aligned}$$
(14)
For each video, the emotional particle’s trajectory is divided in two halves. The energy \((E_i)\) is calculated as the number of later half’s points that are outside the sector of its nominal classification. A global energy measure is defined by Eq. (15).
$$\begin{aligned} E_{g, n} =\sum \limits _0^{N_\mathrm{videos} } E_{i, n} . \end{aligned}$$
(15)
The system then randomly generates neighbor parameter vectors \(Q_{n+1} \) and \(R_{n+1} \). It reanalyzes the tuning videos and obtains \(E_{\mathrm{global}, n+1} .\) The probability of accepting the new parameters as a solution is given by the Metropolis criteria:
$$\begin{aligned} P_{\mathrm{Acceptance}} =\min \left\{ \mathrm{e}^{\frac{\mathop {E_{g, n} -E_{g,n+1}}\limits ^1}{T_{n+1} }} \right. \end{aligned}$$
(16)
These steps are summarized in Algorithm 1.