Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 21;20(17):4727.
doi: 10.3390/s20174727.

End-to-End Training for Compound Expression Recognition

Affiliations

End-to-End Training for Compound Expression Recognition

Hongfei Li et al. Sensors (Basel). .

Abstract

For a long time, expressions have been something that human beings are proud of. That is an essential difference between us and machines. With the development of computers, we are more eager to develop communication between humans and machines, especially communication with emotions. The emotional growth of computers is similar to the growth process of each of us, starting with a natural, intimate, and vivid interaction by observing and discerning emotions. Since the basic emotions, angry, disgusted, fearful, happy, neutral, sad and surprised are put forward, there are many researches based on basic emotions at present, but few on compound emotions. However, in real life, people's emotions are complex. Single expressions cannot fully and accurately show people's inner emotional changes, thus, exploration of compound expression recognition is very essential to daily life. In this paper, we recommend a scheme of combining spatial and frequency domain transform to implement end-to-end joint training based on model ensembling between models for appearance and geometric representations learning for the recognition of compound expressions in the wild. We are mainly devoted to digging the appearance and geometric information based on deep learning models. For appearance feature acquisition, we adopt the idea of transfer learning, introducing the ResNet50 model pretrained on VGGFace2 for face recognition to implement the fine-tuning process. Here, we try and compare two minds, one is that we utilize two static expression databases FER2013 and RAF Basic for basic emotion recognition to fine tune, the other is that we fine tune the model on the input three channels composed of images generated by DWT2 and WAVEDEC2 wavelet transforms based on rbio3.1 and sym1 wavelet bases respectively. For geometric feature acquisition, we firstly introduce a densesift operator to extract facial key points and their histogram descriptions. After that, we introduce deep SAE with a softmax function, stacked LSTM and Sequence-to-Sequence with stacked LSTM and define their structures by ourselves. Then, we feed the salient key points and their descriptions into three models to train respectively and compare their performances. When the model training for appearance and geometric features learning is completed, we combine the two models with category labels to achieve further end-to-end joint training, considering that ensembling models, which describe different information, can further improve recognition results. Finally, we validate the performance of our proposed framework on an RAF Compound database and achieve a recognition rate of 66.97%. Experiments show that integrating different models, which express different information, and achieving end-to-end training can quickly and effectively improve the performance of the recognition.

Keywords: Sequence-to-Sequence; appearance feature; compound expression; deep SAE; end-to-end; frequency domain transform; geometric feature; joint training; model ensembling; stacked LSTM.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The diagram for our recommended scheme. Copyright reference: http://www.whdeng.cn/raf/model1.html#dataset.
Figure 2
Figure 2
The samples of face detection and alignment based on Multi-task Convolutional Neural Network (MTCNN) in FER2013, Real-world Affective Faces (RAF) Basic and RAF Compound. The upper row presents samples in FER2013; the middle row presents samples in RAF Basic; the lower row presents samples in RAF Compound. Copyright reference for FER2013: https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data; copyright reference for RAF database: http://www.whdeng.cn/raf/model1.html#dataset.
Figure 3
Figure 3
The effect of the wavelet transform compared with the Fourier transform. The upper denotes the effect of Fourier transform; the lower denotes the effect of the wavelet transform.
Figure 4
Figure 4
The illumination normalized expression samples of frequency domain transform based on DWT2 in RAF Compound. Approximate component, horizontal detail component, vertical detail component, diagonal detail component from left to right. Copyright reference: http://www.whdeng.cn/raf/model1.html#dataset.
Figure 5
Figure 5
The illumination normalized expression samples of frequency domain transform based on WAVEDEC2 in RAF Compound. The upper row denotes illumination normalized expression samples of WAVEDEC2 transform based on sym1; the lower row denotes illumination normalized expression samples of WAVEDEC2 transform based on rbio3.1. Copyright reference: http://www.whdeng.cn/raf/model1.html#dataset.
Figure 6
Figure 6
The key structure of ResNet.
Figure 7
Figure 7
The main body of the simple AutoEncoder (AE).
Figure 8
Figure 8
The algorithm framework in Sequence-to-Sequence.
Figure 9
Figure 9
The encoder of deep Stack AutoEncoder (SAE) with softmax classifier.
Figure 10
Figure 10
The confusion matrix of the RAF Compound test based on the input channels composed of spatial images.
Figure 11
Figure 11
The confusion matrix of the RAF Compound test based on the input channels composed of the combination of spatial and frequency domain images under wavelet base rbio3.1.
Figure 12
Figure 12
The confusion matrix of the RAF Compound test based on the input channels composed of the combination of spatial and frequency domain images under wavelet base sym1.
Figure 13
Figure 13
The confusion matrix of the RAF Compound test based on frequency domain images under rbio.3.1.
Figure 14
Figure 14
The confusion matrix on the RAF Compound test set for the model ensembling of ResNet50 and DSAE+Softmax (left); the confusion matrix on the RAF Compound test set for the model ensembling of ResNet50 and BLSTM (right).
Figure 15
Figure 15
The confusion matrix on the RAF Compound test set for the model ensembling of ResNet50 and Seq-to-Seq with LSTM+Softmax (left); the confusion matrix on the RAF Compound test set for the model ensembling of ResNet50 and Seq-to-Seq with BLSTM+Softmax (right).

References

    1. Mehrabian A. Nonverbal Communication. Routledge; Abingdon, UK: 2017.
    1. Darwin C., Prodger P. The Expression of the Emotions in Man and Animals. Oxford University Press; Oxford, MS, USA: 1998.
    1. Ekman P., Friesen W. Constants across cultures in the face and emotion. J. Personal. Soc. Psychol. 1971;17:124–129. doi: 10.1037/h0030377. - DOI - PubMed
    1. Suwa M. A preliminary note on pattern recognition of human emotional expression; Proceedings of the 4th International Joint Conference on Pattern Recognition; Kyoto, Japan. 7–10 November 1978; pp. 408–410.
    1. Mase K. An Application of Optical Flow-Extraction of Facial Expression-; Proceedings of the MVA; Tokyo, Japan. 28–30 November 1990; pp. 195–198.