Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 30;18(2):401.
doi: 10.3390/s18020401.

A Brief Review of Facial Emotion Recognition Based on Visual Information

Affiliations

A Brief Review of Facial Emotion Recognition Based on Visual Information

Byoung Chul Ko. Sensors (Basel). .

Abstract

Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of the main information channels in interpersonal communication. This paper provides a brief review of researches in the field of FER conducted over the past decades. First, conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. Deep-learning-based FER approaches using deep networks enabling "end-to-end" learning are then presented. This review also focuses on an up-to-date hybrid deep-learning approach combining a convolutional neural network (CNN) for the spatial features of an individual frame and long short-term memory (LSTM) for temporal features of consecutive frames. In the later part of this paper, a brief review of publicly available evaluation metrics is given, and a comparison with benchmark results, which are a standard for a quantitative comparison of FER researches, is described. This review can serve as a brief guidebook to newcomers in the field of FER, providing basic knowledge and a general understanding of the latest state-of-the-art studies, as well as to experienced researchers looking for productive directions for future work.

Keywords: conventional FER; convolutional neural networks; deep learning-based FER; facial action coding system; facial action unit; facial emotion recognition; long short term memory.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest.

Figures

Figure 1
Figure 1
Procedure used in conventional FER approaches: From input images (a), face region and facial landmarks are detected (b), spatial and temporal features are extracted from the face components and landmarks (c), and the facial expression is determined based on one of facial categories using pre-trained pattern classifiers (face images are taken from CK+ dataset [10]) (d).
Figure 2
Figure 2
Procedure of CNN-based FER approaches: (a) The input images are convolved using filters in the convolution layers. (b) From the convolution results, feature maps are constructed and max-pooling (subsampling) layers lower the spatial resolution of the given feature maps. (c) CNNs apply fully connected neural-network layers behind the convolutional layers, and (d) a single face expression is recognized based on the output of softmax (face images are taken from CK+ dataset [10]).
Figure 3
Figure 3
Sample examples of various facial emotions and AUs: (a) basic emotions (sad, fearful, and angry), (face images are taken from CE dataset [17]) (b) compound emotions (happily surprised, happily disgusted, and sadly fearful) (face images are taken from CE dataset [17]), (c) spontaneous expressions, and (face images are taken from YouTube) (d) AUs (upper and lower face) (face images are taken from CK+ dataset [10]).
Figure 4
Figure 4
The basic structure of an LSTM, adapted from [50]. (a) One LSTM cell contains four interacting layers: the cell state, an input gate layer, a forget gate layer, and an output gate layer, (b) The repeating module of cells in an LSTM.
Figure 5
Figure 5
Overview of the general hybrid deep-learning framework for FER. The outputs of the CNNs and LSTMs are further aggregated into a fusion network to produce a per-frame prediction, adapted from [53].
Figure 6
Figure 6
Examples of nine representative databases related to FER. Databases (a) through (g) support 2D still images and 2D video sequences, and databases (h) through (i) support 3D video sequences.

References

    1. Mehrabian A. Communication without words. Psychol. Today. 1968;2:53–56.
    1. Kaulard K., Cunningham D.W., Bülthoff H.H., Wallraven C. The MPI facial expression database—A validated database of emotional and conversational facial expressions. PLoS ONE. 2012;7:e32321. doi: 10.1371/journal.pone.0032321. - DOI - PMC - PubMed
    1. Dornaika F., Raducanu B. Efficient facial expression recognition for human robot interaction; Proceedings of the 9th International Work-Conference on Artificial Neural Networks on Computational and Ambient Intelligence; San Sebastián, Spain. 20–22 June 2007; pp. 700–708.
    1. Bartneck C., Lyons M.J. HCI and the face: Towards an art of the soluble; Proceedings of the International Conference on Human-Computer Interaction: Interaction Design and Usability; Beijing, China. 22–27 July 2007; pp. 20–29.
    1. Hickson S., Dufour N., Sud A., Kwatra V., Essa I.A. Eyemotion: Classifying facial expressions in VR using eye-tracking cameras. arXiv. 2017. 1707.07204v2 2017