A Survey of AI-Based Facial Emotion Recognition: Features, ML & DL Techniques, Age-Wise Datasets and Future Directions

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Advertisement

Advertisement

Facial emotion recognition using convolutional neural networks (FERC)

  • Research Article
  • Published: 18 February 2020
  • Volume 2 , article number  446 , ( 2020 )

Cite this article

  • Ninad Mehendale 1 , 2  

61k Accesses

191 Citations

14 Altmetric

Explore all metrics

Facial expression for emotion detection has always been an easy task for humans, but achieving the same task with a computer algorithm is quite challenging. With the recent advancement in computer vision and machine learning, it is possible to detect emotions from images. In this paper, we propose a novel technique called facial emotion recognition using convolutional neural networks (FERC). The FERC is based on two-part convolutional neural network (CNN): The first-part removes the background from the picture, and the second part concentrates on the facial feature vector extraction. In FERC model, expressional vector (EV) is used to find the five different types of regular facial expression. Supervisory data were obtained from the stored database of 10,000 images (154 persons). It was possible to correctly highlight the emotion with 96% accuracy, using a EV of length 24 values. The two-level CNN works in series, and the last layer of perceptron adjusts the weights and exponent values with each iteration. FERC differs from generally followed strategies with single-level CNN, hence improving the accuracy. Furthermore, a novel background removal procedure applied, before the generation of EV, avoids dealing with multiple problems that may occur (for example distance from the camera). FERC was extensively tested with more than 750K images using extended Cohn–Kanade expression, Caltech faces, CMU and NIST datasets. We expect the FERC emotion detection to be useful in many applications such as predictive learning of students, lie detectors, etc.

Similar content being viewed by others

facial emotion recognition research papers 2021

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Priyadarsini Samal & Mohammad Farukh Hashmi

facial emotion recognition research papers 2021

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Essam H. Houssein, Asmaa Hammad & Abdelmgeid A. Ali

facial emotion recognition research papers 2021

Facial emotion recognition based real-time learner engagement detection system in online learning context using deep learning models

Swadha Gupta, Parteek Kumar & Raj Kumar Tekchandani

Avoid common mistakes on your manuscript.

1 Introduction

Facial expressions are the vital identifiers for human feelings, because it corresponds to the emotions. Most of the times (roughly in 55% cases) [ 1 ], the facial expression is a nonverbal way of emotional expression, and it can be considered as concrete evidence to uncover whether an individual is speaking the truth or not [ 2 ].

The current approaches primarily focus on facial investigation keeping background intact and hence built up a lot of unnecessary and misleading features that confuse CNN training process. The current manuscript focuses on five essential facial expression classes reported, which are displeasure/anger, sad/unhappy, smiling/happy, feared, and surprised/astonished [ 3 ]. The FERC algorithm presented in this manuscript aims for expressional examination and to characterize the given image into these five essential emotion classes.

Reported techniques on facial expression detection can be described as two major approaches. The first one is distinguishing expressions [ 4 ] that are identified with an explicit classifier, and the second one is making characterization dependent on the extracted facial highlights [ 5 ]. In the facial action coding system (FACS) [ 6 ], action units are used as expression markers. These AUs were discriminable by facial muscle changes.

2 Literature review

Facial expression is the common signal for all humans to convey the mood. There are many attempts to make an automatic facial expression analysis tools [ 7 ] as it has application in many fields such as robotics, medicine, driving assist systems, and lie detector [ 8 , 9 , 10 ]. Since the twentieth century, Ekman et al. [ 11 ] defined seven basic emotions, irrespective of culture in which a human grows with the seven expressions (anger, feared, happy, sad, contempt [ 12 ], disgust, and surprise). In a recent study on the facial recognition technology (FERET) dataset, Sajid et al. found out the impact of facial asymmetry as a marker of age estimation [ 13 ]. Their finding states that right face asymmetry is better compared to the left face asymmetry. Face pose appearance is still a big issue with face detection. Ratyal et al. provided the solution for variability in facial pose appearance. They have used three-dimensional pose invariant approach using subject-specific descriptors [ 14 , 15 ]. There are many issues like excessive makeup [ 16 ] pose and expression [ 17 ] which are solved using convolutional networks. Recently, researchers have made extraordinary accomplishment in facial expression detection [ 18 , 19 , 20 ], which led to improvements in neuroscience [ 21 ] and cognitive science [ 22 ] that drive the advancement of research, in the field of facial expression. Also, the development in computer vision [ 23 ] and machine learning [ 24 ] makes emotion identification much more accurate and accessible to the general population. As a result, facial expression recognition is growing rapidly as a sub-field of image processing. Some of the possible applications are human–computer interaction [ 25 ], psychiatric observations [ 26 ], drunk driver recognition [ 27 ], and the most important is lie detector [ 28 ].

3 Methodology

Convolutional neural network (CNN) is the most popular way of analyzing images. CNN is different from a multi-layer perceptron (MLP) as they have hidden layers, called convolutional layers. The proposed method is based on a two-level CNN framework. The first level recommended is background removal [ 29 ], used to extract emotions from an image, as shown in Fig.  1 . Here, the conventional CNN network module is used to extract primary expressional vector (EV). The expressional vector (EV) is generated by tracking down relevant facial points of importance. EV is directly related to changes in expression. The EV is obtained using a basic perceptron unit applied on a background-removed face image. In the proposed FERC model, we also have a non-convolutional perceptron layer as the last stage. Each of the convolutional layers receives the input data (or image), transforms it, and then outputs it to the next level. This transformation is convolution operation, as shown in Fig.  2 . All the convolutional layers used are capable of pattern detection. Within each convolutional layer, four filters were used. The input image fed to the first-part CNN (used for background removal) generally consists of shapes, edges, textures, and objects along with the face. The edge detector, circle detector, and corner detector filters are used at the start of the convolutional layer 1. Once the face has been detected, the second-part CNN filter catches facial features, such as eyes, ears, lips, nose, and cheeks. The edge detection filters used in this layer are shown in Fig.  3 a. The second-part CNN consists of layers with \(3\times 3\) kernel matrix, e.g., [0.25, 0.17, 0.9; 0.89, 0.36, 0.63; 0.7, 0.24, 0.82]. These numbers are selected between 0 and 1 initially. These numbers are optimized for EV detection, based on the ground truth we had, in the supervisory training dataset. Here, we used minimum error decoding to optimize filter values. Once the filter is tuned by supervisory learning, it is then applied to the background-removed face (i.e., on the output image of the first-part CNN), for detection of different facial parts (e.g., eye, lips. nose, ears, etc.)

figure 1

a Block diagram of FERC. The input image is (taken from camera or) extracted from the video. The input image is then passed to the first-part CNN for background removal. After background removal, facial expressional vector (EV) is generated. Another CNN (the second-part CNN) is applied with the supervisory model obtained from the ground-truth database. Finally, emotion from the current input image is detected. b Facial vectors marked on the background-removed face. Here, nose (N), lip (P), forehead (F), eyes (Y) are marked using edge detection and nearest cluster mapping. The position left, right, and center are represented using L, R, and C, respectively

figure 2

Convolution filter operation with the \(3 \times 3\) kernel. Each pixel from the input image and its eight neighboring pixels are multiplied with the corresponding value in the kernel matrix, and finally, all multiplied values are added together to achieve the final output value

figure 3

a Vertical and horizontal edge detector filter matrix used at layer 1 of background removal CNN (first-part CNN). b Sample EV matrix showing all 24 values in the pixel in top and parameter measured at bottom. c Representation of point in Image domain (top panel) to Hough transform domain (bottom panel) using Hough transform

To generate the EV matrix, in all 24 various facial features are extracted. The EV feature vector is nothing but values of normalized Euclidian distance between each face part, as shown in Fig.  3 b.

3.1 Key frame extraction from input video

FERC works with an image as well as video input. In case, when the input to the FERC is video, then the difference between respective frames is computed. The maximally stable frames occur whenever the intra-frame difference is zero. Then for all of these stable frames, a Canny edge detector was applied, and then the aggregated sum of white pixels was calculated. After comparing the aggregated sums for all stable frames, the frame with the maximum aggregated sum is selected because this frame has maximum details as per edges (more edges more details). This frame is then selected as an input to FERC. The logic behind choosing this image is that blurry images have minimum edges or no edges.

3.2 Background removal

Once the input image is obtained, skin tone detection algorithm [ 30 ] is applied to extract human body parts from the image. This skin tone-detected output image is a binary image and used as the feature, for the first layer of background removal CNN (also referred to as the first-part CNN in this manuscript). This skin tone detection depends on the type of input image. If the image is the colored image, then YCbCr color threshold can be used. For skin tome, the Y -value should be greater than 80, Cb should range between 85 and 140, Cr value should be between 135 and 200. The set of values mentioned in the above line was chosen by trial-and-error method and worked for almost all of the skin tones available. We found that if the input image is grayscale, then skin tone detection algorithm has very low accuracy. To improve accuracy during background removal, CNN also uses the circles-in-circle filter. This filter operation uses Hough transform values for each circle detection. To maintain uniformity irrespective of the type of input image, Hough transform (Fig.  3 c) was always used as the second input feature to background removal CNN. The formula used for Hough transform is as shown in Eq.  1

3.3 Convolution filter

As shown in Fig.  2 for each convolution operation, the entire image is divided into overlapping \(3\times 3\) matrices, and then the corresponding \(3\times 3\) filter is convolved over each \(3\times 3\) matrix obtained from the image. The sliding and taking dot product operation is called ‘convolution’ and hence the name ‘convolutional filter.’ During the convolution, dot product of both \(3\times 3\) matrix is computed and stored at a corresponding location, e.g., (1,1) at the output, as shown in Fig.  2 . Once the entire output matrix is calculated, then this output is passed to the next layer of CNN for another round of convolution. The last layer of face feature extracting CNN is a simple perceptron, which tries to optimize values of scale factor and exponent depending upon deviation from the ground truth.

3.4 Hardware and software details

All the programs were executed on Lenovo Yoga 530 model laptop with Intel i5 8th generation CPU and 8 GB RAM with 512 GB SSD hard disk. Software used to run the experiment were Python (Using Thonny IDE), MATLAB 2018a, and ImageJ.

4 Results and discussions

To analyze the performance of the algorithm, extended Cohn–Kanade expression dataset [ 31 ] was used initially. Dataset had only 486 sequences with 97 posers, causing accuracy to reach up to 45% maximum. To overcome the problem of low efficiency, multiple datasets were downloaded from the Internet [ 32 , 33 ], and also author’s own pictures at different expressions were included. As the number of images in dataset increases, the accuracy also increased. We kept 70% of 10K dataset images as training and 30% dataset images as testing images. In all 25 iterations were carried out, with the different sets of 70% training data each time. Finally, the error bar was computed as the standard deviation. Figure  4 a shows the optimization of the number of layers for CNN. For simplicity, we kept the number of layers and the number of filters, for background removal CNN (first-part CNN) as well as face feature extraction CNN (the second-part CNN) to be the same. In this study, we varied the number of layers from 1 to 8. We found out that maximum accuracy was obtained around 4. It was not very intuitive, as we assume the number of layers is directly proportional to accuracy and inversely proportional to execution time. Hence due to maximum accuracy obtained with 4 layers, we selected the number of layers to be 4. The execution time was increasing with the number of layers, and it was not adding significant value to our study, hence not reported in the current manuscript. Figure  4 b shows the number of filters optimization for both layers. Again, 1–8 filters were tried for each of the four-layer CNN networks. We found that four filters were giving good accuracy. Hence, FERC was designed with four layers and four filters. As a future scope of this study, researchers can try varying the number of layers for both CNN independently. Also, the vast amount of work can be done if each layer is fed with a different number of filters. This could be automated using servers. Due to computational power limitation of the author, we did not carry out this study, but it will be highly appreciated if other researchers come out with a better number than 4 (layers), 4 (filters) and increase the accuracy beyond 96%, which we could achieve. Figure  4 c and e shows regular front-facing cases with angry and surprise emotions, and the algorithm could easily detect them (Fig.  4 d, f). The only challenging part in these images was skin tone detection, because of the grayscale nature of these images. With color images, background removal with the help of skin tone detection was straightforward, but with grayscale images, we observed false face detection in many cases. Image, as shown in Fig.  4 g, was challenging because of the orientation. Fortunately, with 24 dimensions EV feature vector, we could correctly classify 30° oriented faces using FERC. We do accept the method has some limitations such as high computing power during CNN tuning, and also, facial hair causes a lot of issues. But other than these problems, the accuracy of our algorithm is very high (i.e., 96%), which is comparable to most of the reported studies (Table  2 ). One of the major limitations of this method is when all 24 features in EV vector are not obtained due to orientation or shadow on the face. Authors are trying to overcome shadow limitation by automated gamma correction on images (manuscript under preparation). For orientation, we could not find any strong solution, other than assuming facial symmetry. Due to facial symmetry, we are generating missing feature parameters by copying the same 12 values for missing entries in the EV matrix (e.g., the distance between the left eye to the left ear (LY–LE) is assumed the same as a right eye to the right ear (RY–RE), etc.) The algorithm also failed when multiple faces were present in the same image, with equal distance from the camera. For testing data selection, the same dataset with 30% data which was not used for training was used. For each pre-processing epoch, all the 100 % data were taken as new fresh sample data in all 25 folds of training. To find the performance of FERC with large datasets Caltech faces, CMU database and NIST database were used (Table 1 ). It was found that Accuracy goes down with an increasing number of images because of the over-fitting. Also, accuracy remained low, when the number of training images is less. The ideal number of images was found out to be in the range of 2000–10,000 for FERC to work properly.

figure 4

a Optimization for the number of CNN layers. Maximum accuracy was achieved for four-layer CNN. b Optimization for the number of filters. Four filters per layer gave maximum accuracy. c , e , g Different input images from the dataset. d , f , h The output of background removal with a final predicted output of emotion

4.1 Comparison with other methods

As shown in Table  2 , FERC method is a unique method developed with two 4-layer networks with an accuracy of 96%, where others have just gone for a combined approach of solving background removal and face expression detection in a single CNN network. Addressing both issues separately reduces complexity and also the tuning time. Although we only have considered five moods to classify, the sixth and seventh mood cases were misclassified, adding to the error. Zao et al. [ 37 ] have achieved maximum accuracy up to 99.3% but at the cost of 22 layers neural network. Training such a large network is a time-consuming job. Compared to existing methods, only FERC has keyframe extraction method, whereas others have only gone for the last frame. Jung et al. [ 38 ] tried to work with fixed frames which make the system not so efficient with video input. The number of folds of training in most of the other cases was ten only, whereas we could go up to 25-fold training because of small network size.

As shown in Table  3 , FERC has similar complexity as that of Alexnet. FERC is much faster, compared to VGG, GoogleNet, and Resnet. In terms of accuracy, FERC out-performs existing standard networks. However, in some cases we found GoogleNet out-performs FERC, especially when the iteration of GoogleNet reaches in the range of 5000 and above.

Another unique contribution of FERC is skin tone-based feature and Hough transform for circles-in-circle filters. The skin tone is a pretty fast and robust method of pre-processing the input data. We expect that with these new functionalities, FERC will be the most preferred method for mood detection in the upcoming years.

5 Conclusions

FERC is a novel way of facial emotion detection that uses the advantages of CNN and supervised learning (feasible due to big data). The main advantage of the FERC algorithm is that it works with different orientations (less than 30°) due to the unique 24 digit long EV feature matrix. The background removal added a great advantage in accurately determining the emotions. FERC could be the starting step, for many of the emotion-based applications such as lie detector and also mood-based learning for students, etc.

Mehrabian A (2017) Nonverbal communication. Routledge, London

Book   Google Scholar  

Bartlett M, Littlewort G, Vural E, Lee K, Cetin M, Ercil A, Movellan J (2008) Data mining spontaneous facial behavior with automatic expression coding. In: Esposito A, Bourbakis NG, Avouris N, Hatzilygeroudis I (eds) Verbal and nonverbal features of human–human and human–machine interaction. Springer, Berlin, pp 1–20

Google Scholar  

Russell JA (1994) Is there universal recognition of emotion from facial expression? A review of the cross-cultural studies. Psychol Bull 115(1):102

Article   Google Scholar  

Gizatdinova Y, Surakka V (2007) Automatic detection of facial landmarks from AU-coded expressive facial images. In: 14th International conference on image analysis and processing (ICIAP). IEEE, pp 419–424

Liu Y, Li Y, Ma X, Song R (2017) Facial expression recognition with fusion features extracted from salient facial areas. Sensors 17(4):712

Ekman R (1997) What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS). Oxford University Press, New York

Zafar B, Ashraf R, Ali N, Iqbal M, Sajid M, Dar S, Ratyal N (2018) A novel discriminating and relative global spatial image representation with applications in CBIR. Appl Sci 8(11):2242

Ali N, Zafar B, Riaz F, Dar SH, Ratyal NI, Bajwa KB, Iqbal MK, Sajid M (2018) A hybrid geometric spatial image representation for scene classification. PLoS ONE 13(9):e0203339

Ali N, Zafar B, Iqbal MK, Sajid M, Younis MY, Dar SH, Mahmood MT, Lee IH (2019) Modeling global geometric spatial information for rotation invariant classification of satellite images. PLoS ONE 14:7

Ali N, Bajwa KB, Sablatnig R, Chatzichristofis SA, Iqbal Z, Rashid M, Habib HA (2016) A novel image retrieval based on visual words integration of SIFT and SURF. PLoS ONE 11(6):e0157428

Ekman P, Friesen WV (1971) Constants across cultures in the face and emotion. J Personal Soc Psychol 17(2):124

Matsumoto D (1992) More evidence for the universality of a contempt expression. Motiv Emot 16(4):363

Sajid M, Iqbal Ratyal N, Ali N, Zafar B, Dar SH, Mahmood MT, Joo YB (2019) The impact of asymmetric left and asymmetric right face images on accurate age estimation. Math Probl Eng 2019:1–10

Ratyal NI, Taj IA, Sajid M, Ali N, Mahmood A, Razzaq S (2019) Three-dimensional face recognition using variance-based registration and subject-specific descriptors. Int J Adv Robot Syst 16(3):1729881419851716

Ratyal N, Taj IA, Sajid M, Mahmood A, Razzaq S, Dar SH, Ali N, Usman M, Baig MJA, Mussadiq U (2019) Deeply learned pose invariant image analysis with applications in 3D face recognition. Math Probl Eng 2019:1–21

Article   MATH   Google Scholar  

Sajid M, Ali N, Dar SH, Iqbal Ratyal N, Butt AR, Zafar B, Shafique T, Baig MJA, Riaz I, Baig S (2018) Data augmentation-assisted makeup-invariant face recognition. Math Probl Eng 2018:1–10

Ratyal N, Taj I, Bajwa U, Sajid M (2018) Pose and expression invariant alignment based multi-view 3D face recognition. KSII Trans Internet Inf Syst 12:10

Xie S, Hu H (2018) Facial expression recognition using hierarchical features with deep comprehensive multipatches aggregation convolutional neural networks. IEEE Trans Multimedia 21(1):211

Danisman T, Bilasco M, Ihaddadene N, Djeraba C (2010) Automatic facial feature detection for facial expression recognition. In: Proceedings of the International conference on computer vision theory and applications, pp 407–412. https://doi.org/10.5220/0002838404070412

Mal HP, Swarnalatha P (2017) Facial expression detection using facial expression model. In: 2017 International conference on energy, communication, data analytics and soft computing (ICECDS). IEEE, pp 1259–1262

Parr LA, Waller BM (2006) Understanding chimpanzee facial expression: insights into the evolution of communication. Soc Cogn Affect Neurosci 1(3):221

Dols JMF, Russell JA (2017) The science of facial expression. Oxford University Press, Oxford

Kong SG, Heo J, Abidi BR, Paik J, Abidi MA (2005) Recent advances in visual and infrared face recognition—a review. Comput Vis Image Underst 97(1):103

Xue Yl, Mao X, Zhang F (2006) Beihang university facial expression database and multiple facial expression recognition. In: 2006 International conference on machine learning and cybernetics. IEEE, pp 3282–3287

Kim DH, An KH, Ryu YG, Chung MJ (2007) A facial expression imitation system for the primitive of intuitive human-robot interaction. In: Sarkar N (ed) Human robot interaction. IntechOpen, London

Ernst H (1934) Evolution of facial musculature and facial expression. J Nerv Ment Dis 79(1):109

Kumar KC (2012) Morphology based facial feature extraction and facial expression recognition for driver vigilance. Int J Comput Appl 51:2

Hernández-Travieso JG, Travieso CM, Pozo-Baños D, Alonso JB et al (2013) Expression detector system based on facial images. In: BIOSIGNALS 2013-proceedings of the international conference on bio-inspired systems and signal processing

Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human–computer interaction. IEEE Signal Process Mag 18(1):32

Hsu RL, Abdel-Mottaleb M, Jain AK (2002) Face detection in color images. IEEE Trans Pattern Anal Mach Intell 24(5):696

Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended Cohn–Kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE computer society conference on computer vision and pattern recognition-workshops. IEEE, pp 94–101

Littlewort G, Whitehill J, Wu T, Fasel I, Frank M, Movellan J, Bartlett M (2011) The computer expression recognition toolbox (CERT). In: Face and gesture 2011. IEEE, pp 298–305

Shan C, Gong S, McOwan PW (2009) Facial expression recognition based on local binary patterns: a comprehensive study. Image Vis Comput 27(6):803

Caltech Faces (2020) http://www.vision.caltech.edu/html-files/archive.html . Accessed 05 Jan 2020

The CMU multi-pie face database (2020) http://ww1.multipie.org/ . Accessed 05 Jan 2020

NIST mugshot identification database (2020) https://www.nist.gov/itl/iad/image-group/resources/biometric-special-databases-and-software . Accessed 05 Jan 2020

Zhao X, Liang X, Liu L, Li T, Han Y, Vasconcelos N, Yan S (2016) Peak-piloted deep network for facial expression recognition. In: European conference on computer vision. Springer, pp 425–442

Jung H, Lee S, Yim J, Park S, Kim J (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE international conference on computer vision. pp 2983–2991

Zhang K, Huang Y, Du Y, Wang L (2017) Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans Image Process 26(9):4193

Article   MathSciNet   MATH   Google Scholar  

Wu YL, Tsai HY, Huang YC, Chen BH (2018) Accurate emotion recognition for driving risk prevention in driver monitoring system. In: 2018 IEEE 7th global conference on consumer electronics (GCCE). IEEE, pp 796–797

Gajarla V, Gupta A (2015) Emotion detection and sentiment analysis of images. Georgia Institute of Technology, Atlanta

Giannopoulos P, Perikos I, Hatzilygeroudis I (2018) Deep learning approaches for facial emotion recognition: a case study on FER-2013. In: Hatzilygeroudis I, Palade V (eds) Advances in hybridization of intelligent methods. Springer, Berlin, pp 1–16

Download references

Acknowledgements

The author would like to thank Dr. Madhura Mehendale for her constant support on database generation and corresponding ground truths cross-validation. Also, the author would like to thank all the colleagues at K. J. Somaiya College of Engineering.

Author information

Authors and affiliations.

Ninad’s Research Lab, Thane, India

Ninad Mehendale

K. J. Somaiya College of Engineering, Mumbai, India

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Ninad Mehendale .

Ethics declarations

Conflict of interest.

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Mehendale, N. Facial emotion recognition using convolutional neural networks (FERC). SN Appl. Sci. 2 , 446 (2020). https://doi.org/10.1007/s42452-020-2234-1

Download citation

Received : 16 July 2019

Accepted : 12 February 2020

Published : 18 February 2020

DOI : https://doi.org/10.1007/s42452-020-2234-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Emotion recognition
  • Facial expression
  • Find a journal
  • Publish with us
  • Track your research

ORIGINAL RESEARCH article

Automatic facial expression recognition in standardized and non-standardized emotional expressions.

\nTheresa Küntzler*&#x;

  • 1 Department of Politics and Public Administration, Center for Image Analysis in the Social Sciences, Graduate School of Decision Science, University of Konstanz, Konstanz, Germany
  • 2 Department of Psychology, School of Social Sciences, University of Mannheim, Mannheim, Germany

Emotional facial expressions can inform researchers about an individual's emotional state. Recent technological advances open up new avenues to automatic Facial Expression Recognition (FER). Based on machine learning, such technology can tremendously increase the amount of processed data. FER is now easily accessible and has been validated for the classification of standardized prototypical facial expressions. However, applicability to more naturalistic facial expressions still remains uncertain. Hence, we test and compare performance of three different FER systems (Azure Face API, Microsoft; Face++, Megvii Technology; FaceReader, Noldus Information Technology) with human emotion recognition (A) for standardized posed facial expressions (from prototypical inventories) and (B) for non-standardized acted facial expressions (extracted from emotional movie scenes). For the standardized images, all three systems classify basic emotions accurately (FaceReader is most accurate) and they are mostly on par with human raters. For the non-standardized stimuli, performance drops remarkably for all three systems, but Azure still performs similarly to humans. In addition, all systems and humans alike tend to misclassify some of the non-standardized emotional facial expressions as neutral. In sum, emotion recognition by automated facial expression recognition can be an attractive alternative to human emotion recognition for standardized and non-standardized emotional facial expressions. However, we also found limitations in accuracy for specific facial expressions; clearly there is need for thorough empirical evaluation to guide future developments in computer vision of emotional facial expressions.

1. Introduction

Detecting emotional processes in humans is important in many research fields such as psychology, affective neuroscience, or political science. Emotions influence information processing (e.g., Marcus et al., 2000 ; Meffert et al., 2006 ; Fraser et al., 2012 ; Soroka and McAdams, 2015 ), attitude formation (e.g., Lerner and Keltner, 2000 ; Marcus, 2000 ; Brader, 2005 ), and decision making ( Clore et al., 2001 ; Slovic et al., 2007 ; Pittig et al., 2014 ). One well-established strategy to measure emotional reactions of individuals is to track their facial expressions ( Scherer and Ellgring, 2007 ; Keltner and Cordaro, 2017 ). The classic approach to analyse emotional facial responses is either an expert observation such as the Facial Action Coding System (FACS) ( Sullivan and Masters, 1988 ; Ekman and Rosenberg, 1997 ; Cohn et al., 2007 ) or direct measurement of facial muscle activity with electromyography (EMG) ( Cohn et al., 2007 ). Both are, however, time-consuming with respect to both, application and analysis.

A potential alternative to facilitate, standardize, and scale research on facial expressions is automatic image-based Facial Expression Recognition (FER), which has recently emerged from computer vision technology. Using machine learning, algorithms are being developed that extract emotion scores from observed facial expressions ( Goodfellow et al., 2015 ; Arriaga et al., 2017 ; Quinn et al., 2017 ), which is considerably more time and cost efficient compared to classical approaches ( Bartlett et al., 1999 ). FER is easily accessible to researchers of all fields and is increasingly used by the scientific community. Applications can be found, for example, in psychology, where such algorithms are used to predict mental health from social media images ( Yazdavar et al., 2020 ), to validate interventions for autism ( Wu et al., 2019 ), or to screen for Parkinson's disease ( Jin et al., 2020 ). A sociological example is the assessment of collective happiness in society from social media images ( Abdullah et al., 2015 ). In political science, one example is the study of representation of politicians in the media using FER ( Boxell, 2018 ; Peng, 2018 ; Haim and Jungblut, 2020 ). Furthermore, the technology is used in consumer and market research, for example to predict advertisement efficiency ( Lewinski et al., 2014 ; Teixeira et al., 2014 ; Bartkiene et al., 2019 ).

1.1. Prototypical vs. Naturalistic Facial Expressions

Training and testing of FER tools is typically conducted on data sets, which contain prototypical and potentially exaggerated expressions ( Dhall et al., 2012 ). The images of these inventories are created under standardized (detailed instructions for the actors) and well-controlled conditions (e.g., lighting, frontal face angle; Lewinski et al., 2014 ; Calvo et al., 2018 ; Stöckli et al., 2018 ; Beringer et al., 2019 ; Skiendziel et al., 2019 ). As a result, the classification performance of FER systems and its generalizability to non-standardized and more naturalistic facial expressions is uncertain.

For prototypical facial expressions, FER also corresponds well to human FACS coding ( Bartlett et al., 1999 ; Tian et al., 2001 ; Skiendziel et al., 2019 ) and non-expert human classification ( Bartlett et al., 1999 ; Lewinski, 2015 ; Calvo et al., 2018 ; Stöckli et al., 2018 ). Accuracy is high for static images ( Lewinski et al., 2014 ; Lewinski, 2015 ; Stöckli et al., 2018 ; Beringer et al., 2019 ) as well as for dynamic facial expressions from standardized inventories ( Mavadati et al., 2013 ; Zhang et al., 2014 ; Yitzhak et al., 2017 ; Calvo et al., 2018 ). There is also growing evidence that FER provides valid measures for most emotion categories if naive participants are instructed to pose intense emotional facial expressions in a typical lab setting with frontal face recording and good lighting condition ( Stöckli et al., 2018 ; Beringer et al., 2019 ; Sato et al., 2019 ; Kulke et al., 2020 ). However, all of these studies present their participants prototypical facial expression and instruct them to mimic these visual cues. This might result in an overestimation of FER performance in comparison to non-standardized facial expressions and moreover truly naturalistic emotional facial expressions.

Previous research also documents systematic misclassification of different FER systems and emotion categories. For fear, studies find a consistently lower accuracy compared to other emotion categories ( Lewinski et al., 2014 ; Stöckli et al., 2018 ; Skiendziel et al., 2019 ). Some studies also report a substantial decrease in accuracy for anger ( Lewinski et al., 2014 ; Stöckli et al., 2018 ; Dupré et al., 2020 ), whereas Skiendziel et al. (2019) report an improvement of this measurement in their study. Less consistently, sadness ( Lewinski et al., 2014 ; Skiendziel et al., 2019 ) and disgust are also found to be error prone ( Skiendziel et al., 2019 ). In contrast, the facial expression of joy is systematically classified with the highest accuracy ( Stöckli et al., 2018 ; Skiendziel et al., 2019 ; Dupré et al., 2020 ). When looking at confusion between emotions in prior studies, FaceReader shows a tendency toward increased neutral measures for all other emotions ( Lewinski et al., 2014 ) and a tendency to misclassify fearful faces as surprise ( Stöckli et al., 2018 ; Skiendziel et al., 2019 ). Studies that compared different FER systems consistently find a large variation in performance between systems ( Stöckli et al., 2018 ; Dupré et al., 2020 ) which underlines the need for comparatives studies.

Besides a general lack of studies, that directly compare different FER systems, empirical validation of FER to recognize emotional facial expressions is limited to intensely posed expressions. In contrast to those images, naturalistic or spontaneous facial expressions show stronger variations and are often less intense in comparison to standardized facial expressions ( Calvo and Nummenmaa, 2016 ; Barrett et al., 2019 ). For example Sato et al. (2019) find a strong decrease in FER performance if participants respond spontaneously to imagined emotional episodes. Höfling et al. (2020 ) report strong correlations of FER parameters and participants' emotion ratings that spontaneously respond to pleasant emotional scenes, but find no evidence for a valid FER detection of spontaneous unpleasant facial reactions. Other studies report a decrease in FER emotion recognition for more subtle and naturalistic facial expressions ( Höfling et al., 2021 ) and find a superiority of humans to decode such emotional facial responses ( Yitzhak et al., 2017 ; Dupré et al., 2020 ). However, the data sets applied are still comprised of images collected in a controlled lab setting, with little variation on lighting, camera angle, or age of the subject which might further decrease FER performance under less restricted recording conditions.

1.2. Aims, Overview, and Expectations

In summary, FER offers several advantages in terms of efficiency and we already know that it performs well on standardized, prototypical emotional facial expressions. Despite many advantages of FER application and their validity to decode prototypical facial expression, the quality of the expression measurement and its generalizability to less standardized facial expressions is uncertain. Because the underlying algorithms remain unclear to the research community, including the applied machine-learning and its specific training procedure, empirical performance evaluation is urgently needed. Hence, this paper has two main aims: First, we provide an evaluation and a comparison of three widely used systems that are trained to recognize emotional facial expressions (FaceReader, Face++, and the Azure Face API) and compare them with human emotion recognition data as a benchmark. Second, we evaluate the systems on acted standardized and non-standardized emotional facial expressions: The standardized facial expressions are a collection of four facial expression inventories created in a lab setting displaying intense prototypical facial expressions [The Karolinska Directed Emotional Faces ( Lundqvist et al., 1998 ), the Radboud Faces Database ( Langer et al., 2010 ), the Amsterdam Dynamic Facial Expression Set ( Van der Schalk et al., 2011 ), and the Warsaw Set of Emotional Facial Expression ( Olszanowski et al., 2015 )]. To approximate more naturalistic emotional expressions, we use a data set of non-standardized facial expressions: The Static Facial Expressions in the Wild data set ( Dhall et al., 2018 ), which is built from movie scenes and covers a larger variety of facial expressions, lighting, camera position, and actor ages.

FER systems provide estimations for the intensity of specific emotional facial expressions through two subsequent steps: The first step is face detection including facial feature detection and the second step is face classification into an emotion category. For face detection, we expect that different camera angles, but also characteristics of the face such as glasses or beards will increase FER face detection failures resulting in higher rates of drop out. We expect the standardized expressions to result in less drop out due to failures in face detection, since the camera angle is constantly frontal, and no other objects such as glasses obstruct the faces. Correspondingly, we expect more drop out in the non-standardized data set, which means there are more images where faces are not detected, since the variability of the facial expressions is higher. For the second step (i.e., emotion classification), we expect strong variation between emotion categories (e.g., increased performance for joy faces, decreased performance on fear faces). We further expect a tendency toward the neutral category and a misclassification of fear as surprise. As explained for the drop outs, we assume the non-standardized images to be more variable and therefore more difficult to classify. The overall performance on the non-standardized data is therefore expected to be lower. This research provides important information about the generalizability of FER to more naturalistic, non-standardized emotional facial expressions and moreover the performance comparison of specific FER systems.

2. Materials and Methods

We use three different facial expression recognition tools and human emotion recognition data to analyze emotional facial expressions in more or less standardized facial expressions. As an approximation to standardized and non-standardized facial expressions we analyze static image inventories of actors who were instructed to display prototypical emotional expressions and, in addition, an inventory of actors displaying more naturalistic emotional facial expressions in movie stills. We extract probability parameters for facial expressions corresponding to six basic emotions (i.e., joy, anger, sadness, disgust, fear, surprise, and neutral) from all tools. As a benchmark, we collect data from human raters who rated subsets of the same images.

2.1. Images of Facial Expressions

We test the different FER tools as well as human facial recognition data on standardized and non-standardized emotional facial expressions displayed in still images. All selected inventories are publicly available for research and contain emotional facial expression images of the basic emotion categories. Table 1 displays the emotion categories and image distributions for both data sets (i.e., standardized and non-standardized) including drop out rates specifically for the three FER tools.

www.frontiersin.org

Table 1 . Category distributions of test data and drop outs of Azure, Face++, and FaceReader.

Standardized facial expressions are a collection of images created in the lab with controlled conditions (i.e., good lighting, frontal head positions, directed view) displaying prototypical expressions of clearly defined emotions. In order to maximize image quantity and introduce more variability, the prototypical images consist of four databases: (1) The Karolinska Directed Emotional Faces contains images of 35 males and 35 females between 20 and 30 years old ( Lundqvist et al., 1998 ). The present study uses all frontal images (resolution: 562 × 762). (2) The Radboud Faces Database, which contains images of facial expressions of 20 male and 19 female Caucasian Dutch adults ( Langer et al., 2010 ). We used the subset of adult models looking straight into the camera with images taken frontal (resolution: 681 × 1,024). (3) The Amsterdam Dynamic Facial Expression Set, from which we used the still image set (resolution: 720 × 576). The models are distinguished between being Northern-European (12 models, 5 females) and Mediterranean (10 models, 5 of them female; Van der Schalk et al., 2011 ). (4) The Warsaw Set of Emotional Facial Expression offers images of 40 models (16 females, 14 males) displaying emotional facial expressions ( Olszanowski et al., 2015 ). Images are taken frontal and the complete set is used in this study (resolution: 1,725 × 1,168). This results in an overall of 1,246 images evenly distributed over the relevant emotion categories.

Non-standardized facial expressions stem from a data set that was developed as a benchmark test for computer vision research for more naturalistic settings. The Static Facial Expressions in the Wild (SFEW) data set consists of stills from movie scenes that display emotions in the actors' faces. Examples of movies are “Harry Potter” or “Hangover” ( Dhall et al., 2018 ). This study uses the updated version ( Dhall et al., 2018 ). The data set was compiled using the subtitles for the deaf and hearing impaired and closed caption subtitles. These subtitles contain not only the spoken text, but additional information about surrounding sounds, such as laughter. The subtitles were automatically searched for words suggesting emotional content. Scenes resulting from this search were then suggested to trained human coders, who classified and validated the final selection of emotional facial expressions for this inventory ( Dhall et al., 2012 ). We use these images to rigorously test how well the systems perform on images that are not prototypical and not taken under standardized conditions (variable lighting and head positions). The inventory consists of 1,387 images (resolution: 720 × 576) which are unevenly distributed across emotion categories (minimum of 88 images for disgust and a maximum of 270 images for joy).

2.2. Facial Expression Recognition Tools

We test three FER tools: The Microsoft Azure Face API (Version 1.0, Microsoft), Face++ (Version 3.0, Megvii Technology) and FaceReader (Version 8.0, Noldus Information Technology). The first two are easily accessible APIs, which also offer a free subscription. FaceReader is a software to be installed locally on a computer and is well-established in the research community. Each of the systems allow to analyse faces in images, with functions such as face detection, face verification, and emotion recognition. They all provide probability scores for neutral, joy, sadness, anger, disgust, fear, and surprise. While scores of Azure and FaceReader are between 0 and 1, Face++ uses a scale from 1 to 100. We thus rescale Face++ scores to 0 to 1. FaceReader specifically provides an additional quality parameter and it is suggested to remove images, if the quality of face detection is too low. Therefore, we remove all images with a quality parameter below 70%.

2.3. Human Emotion Recognition

As a benchmark for the FER results we collected emotion recognition data of humans who each rate a random subsample of up to 127 of the 2,633 images each in an online study. Participants who rated less than 20 images are excluded for further analyses (17 participants rated between 20 and 126 pictures). This results in 101 participants (58 female, 42 male, 1 diverse, M age = 29.2, SD age = 9.1) who rated on average 116.1 (SD = 28.1) images. Twenty-five images were randomly not rated by any participants (<1%). Participants were instructed to classify facial expression as neutral, joy, sadness, anger, disgust, fear, surprise, or another emotion. Multiple choices were possible. In addition, the perceived genuineness of the expressed emotion was rated on a 7-point Likert scale (1 -very in-genuine, 7 -very genuine). All ratings are averaged per image to improve comparability to the metric provided by the FER tools. This results in percentages of emotion ratings and average values per image for the genuineness ratings.

2.4. Analyses

First, we analyze the human raters' scores for perceived genuineness and emotion classification as a manipulation check for the two data sets of facial expressions. Differences between the genuineness of non-standardized vs. standardized facial expressions are tested statistically for all images as well as separately for all emotion categories utilizing independent t -tests. Correspondingly, we analyze the human emotion recognition data to provide a benchmark for the FER comparison. Again we statistically test for differences between non-standardized vs. standardized facial expressions for all emotion categories utilizing independent t -tests. In addition, we calculate one-sample t -tests against zero to estimate patterns of misclassification within human emotion recognition. Cohen's d is reported for all t -tests.

Second, we test the performance of face detection. As described above, FER is a two step process of first face detection and second emotion classification. To test performances on face detection, we check for how many images a specific tool gives no result (drop out rate).

Third, we calculate several indices of emotion classification (i.e., accuracy, sensitivity, and precision) for the three FER tools to report performance differences descriptively. In order to evaluate emotion classification, each algorithm's output is compared to the original coding of the intended emotional facial expression category (i.e., ground truth). The different tools return values for each emotion category. We define the category with the highest certainty as the chosen one, corresponding to a winner–takes–all principle 1 . A general indication of FER performance is the accuracy, which is the share of correctly identified images out of all images, where a face is processed (thus, excluding drop out) 2 . Other excellent measures to evaluate emotion classification are category specific sensitivity and precision. Sensitivity describes the share of correctly predicted images out of all images truly in the respective category. It is a measure of how well the tool does in detecting a certain category. Precision is the share of correctly predicted images out of all images predicted as one category. In other words, precision is a measure of how much we can trust the categorization of the tool. In order to identify patterns of classifications, we additionally build confusion matrices for the FER measurement and true categories.

Fourth, we report differences in emotion recognition performance between the three systems and human data with Receiver Operating Characteristic (ROC) analysis and statistical testing of the corresponding Area Under the Curve (AUC). ROC analysis is initially a two-class classification strategy. In order to apply the ROC rationale to a multi-class classification, we consider each probability given to a category as one observation. In other words, each image makes up for seven observations for each tool. The ROC curve plots a true positive share against a false positive share for varying probability thresholds above which a category is considered correct. A good classifier gives low probabilities to wrong classifications and high probabilities to correct classifications. This is measured by the AUC. Better classifiers give larger AUCs. We compare AUCs of the different algorithms pairwise, using a bootstrapping method with 2,000 draws ( Robin et al., 2011 ).

Analyses are conducted in R ( R Core Team, 2019 ), using the following packages (alphabetical order): caret ( Kuhn, 2020 ), data.table ( Dowle and Srinivasan, 2020 ), dplyr ( Wickham et al., 2020 ), extrafont ( Chang, 2014 ), ggplot2 ( Wickham, 2016 ), httr ( Wickham, 2020 ), jsonlite ( Ooms, 2014 ), patchwork ( Pedersen, 2020 ), plotROC ( Sachs, 2017 ), pROC ( Robin et al., 2011 ), purrr ( Henry and Wickham, 2020 ), RColorBrewer ( Neuwirth, 2014 ), stringr ( Wickham, 2019 ), tidyverse ( Wickham et al., 2019 ).

3.1. Human Raters: Genuineness of Facial Expressions

We test for differences between standardized and non-standardized facial expression inventories regarding their perceived genuineness (see Figure 1A ). Analysis shows that the non-standardized facial expressions are perceived as much more genuine compared to the standardized facial expressions [standardized inventories: M = 4.00, SD = 1.43; non-standardized inventory: M = 5.64, SD = 0.79; t (2606) = 36.58, p < 0.001, d = 1.44]. In particular, non-standardized facial expressions are rated as more genuine for anger, t (426) = 27.97, p < 0.001, d = 2.75, sadness, t (418) = 25.55, p < 0.001, d = 2.43, fear, t (317) = 21.10, p < 0.001, d = 2.38, disgust, t (263) = 18.10, p < 0.001, d = 2.36, surprise, t (322) = 16.02, p < 0.001, d = 1.79, and joy, t (441) = 5.58, p < 0.001, d = 0.54, whereas among the standardized inventories neutral facial expressions are rated more genuine, t (407) = 2.36, p = 0.019, d = 0.24. These results support the validity of the selection of image test data—the standardized facial expressions are perceived less genuine compared to the non-standardized facial expressions.

www.frontiersin.org

Figure 1 . Averaged human ratings separately for basic emotion categories for standardized (black bars) and non-standardized facial expressions (gray bars). (A) Depicts mean genuineness ratings ranging from 1 (very in-genuine) to 7 (very genuine). (B–H) Depict mean emotion ratings (percent) for (B) neutral, (C) joy, (D) anger, (E) disgust, (F) sadness, (G) fear, and (H) surprise expressions. Error bars are 95% confidence intervals.

3.2. Human Raters: Emotion Recognition

Next, we analyze the human emotion ratings (see Figures 1B–H ). Comparisons against zero show that for most emotion categories, classifications are highest for the correct category. The only exception are non-standardized disgust faces that are more often categorized as angry, t (87) = 7.99, p < 0.001, d = 0.85, than disgusted, t (87) = 4.40, p < 0.001, d = 0.47. In addition, fearful faces are also misclassified (or at least co-classified) as surprise for standardized, t (175) = 18.22, p < 0.001, d = 1.37, and non-standardized facial expressions, t (142) = 10.69, p < 0.001, d = 0.89. A comparison between standardized and non-standardized data reveals a strong increase in neutral ratings for non-standardized emotion categories [disgust: t (263) = 15.03, p < 0.001, d = 1.96; surprise: t (322) = 14.33, p < 0.001, d = 1.60; fear: t (317) = 9.54, p < 0.001, d = 1.07; sadness: t (418) = 9.01, p < 0.001, d = 0.89; anger: t (426) = 7.96, p < 0.001, d = 0.78; joy: t (441) = 4.26, p < 0.001, d = 0.41]. Correspondingly, non-standardized facial expressions show a strong decrease in the correct emotion category compared to standardized facial expressions for some categories [disgust: t (263) = 24.63, p < 0.001, d = 3.21; surprise: t (322) = 14.35, p < 0.001, d = 1.60; sadness: t (418) = 10.28, p < 0.001, d = 1.02; neutral: t (407) = 8.99, p < 0.001, d = 0.90; anger: t (426) = 8.03, p < 0.001, d = 0.79; joy: t (441) = 5.83, p < 0.001, d = 0.57; fear: t (317) = 3.79, p < 0.001, d = 0.43]. Taken together, non-standardized compared to standardized facial expressions are perceived more often as neutral and less emotionally intense on average.

3.3. FER Systems: Drop Out

To evaluate the step of face detection, we report drop out rates separately for each FER tool in Table 1 . Drop out for the standardized data is nearly non-existent, however, strong differences can be reported for the non-standardized data set. Azure returns no face detection for around 20% of the images. For FaceReader, the drop out is even higher with 74% 3 . This result partially confirms our expectations, as for Azure and FaceReader the drop out in the non-standardized data is much higher than among the standardized data. In contrast, Face++ shows superior face detection with nearly no drop out for the non-standardized data. See Supplementary Table 1 for statistical comparison of the drop out rates.

3.4. FER Systems: Emotion Recognition

To descriptively compare classification performance, we report accuracies for each tool on each data set, along with category specific sensitivity and precision ( Table 2 ). Details on the statistical comparisons can be found in Supplementary Table 2 4 . As expected, accuracy is better for all tools on the standardized data. FaceReader performs best, with 97% of the images classified correctly. The difference to both Azure and Face++ is significant ( p < 0.001). Azure and Face++ perform similarly, p = 0.148, both put around 80% of the images in the correct category. For the non-standardized data, accuracy is much lower. Azure performs best, still correctly classifying 56% of the images. FaceReader and Face++ both correctly classify only about one third of the non-standardized images which constitutes a significant decrease of accuracy compared to Azure ( p < 0.001).

www.frontiersin.org

Table 2 . Sensitivity, precision, and accuracy of Azure, Face++, and FaceReader separately for emotion categories.

Looking at the specific emotion categories and their performance indices, joy expressions are classified best. For the standardized data, both sensitivity and precision are or nearly are all 1. Also for the non-standardized data, the joy category is classified best. However, Azure is the only software with overall acceptable performance. In the standardized angry category, all tools show high precision, however Azure and Face++ lack in sensitivity. For the non-standardized angry category, only Azure's precision is acceptable. Face++, and FaceReader do not perform reliably. Performance on the other categories on the standardized data resembles each other: FaceReader clearly outperforms the other tools. In contrast, for the non-standardized facial expressions, Azure performs best, although the values are substantially decreased in comparison to standardized facial expressions.

To study confusion rates between categories, Figure 2 depicts confusion matrices between the true labels and the highest rated emotion by each software. In the standardized data, all three tools show the pattern of classifying fearful expressions as surprise or sadness. The confusion between fear and surprise is expected, whereas the confusion of fear with sadness is new. Additionally, Azure and Face++ show a tendency to misclassify anger, sadness and fear as neutral. For FaceReader, this tendency is observable to a smaller extent. This reflects partially the expected tendency toward a neutral expression. In the non-standardized data set, all applications show a pronounced tendency toward the neutral category. Additionally, Face++ shows a trend toward surprise, sadness and fear. To a smaller extend, the misclassification to surprise and sadness is problematic in Azure and FaceReader alike.

www.frontiersin.org

Figure 2 . Confusion matrices indicating classification performance on standardized (left panels) and non-standaridzed data (right panels): (A) standardized data by Azure, (B) non-standardized data by Azure, (C) standardized data by Face++, (D) non-standardized data by Face++, (E) standardized data by FaceReader and (F) non-standardized data by FaceReader. Numbers indicate percentages to the base of the true category. Reading example: From the standardized data Azure classifies 4.5% of the truly fearful expressions as neutral. The 45.5% of the fearful images are classified correctly.

3.5. Humans vs. FER: Comparison of Emotion Recognition

To directly compare all sources of emotion recognition, we calculate ROC curves and report them in Figure 3 along with the corresponding AUCs. ROC curves for specific emotion categories are shown in Supplementary Figure 2 and corresponding statistical comparisons are reported in Supplementary Table 5 5 .

www.frontiersin.org

Figure 3 . Classification performance depicted as Receiver Operating Characteristic (ROC) curves and corresponding Area under the Curve (AUC) for overall emotion recognition performance for the three FER systems (Azure, Face++, and FaceReader) and human raters. Separately for (A) standardized facial expressions and (B) non-standardized facial expressions separately. The white diagonal line indicates classification performance by chance.

For the standardized facial expressions (see Figure 3A ), humans, overall, recognize them significantly better than Azure, p = 0.035, and Face++, p < 0.001. However, FaceReader performs significantly better than humans on such facial expressions, p < 0.001. While the same pattern holds true for fear faces (Azure: p = 0.003; FaceReader: p < 0.001, Face++: p < 0.001), all algorithms perform significantly better than humans for neutral (Azure: p < 0.001; FaceReader: p < 0.001, Face++: p < 0.001), joy (Azure: p = 0.023; FaceReader: p = 0.024, Face++: p = 0.027), and surprise expressions (Azure: p = 0.012; FaceReader: p = 0.012, Face++: p = 0.013). Also, for standardized facial expressions of disgust, FaceReader, p = 0.002, and Face++, p = 0.023, perform better compared to humans while Azure is comparable to humans, p = 0.450. Regarding anger, FaceReader, and humans perform comparably, p = 0.353, and both outperform Azure and Face++, p < 0.001. Finally, FaceReader shows better classification of sad faces compared to Azure, p = 0.078, Face++, p < 0.001, and humans, p = 0.021.

For the non-standardized facial expressions (see Figure 3B ), humans overall show similar performance to Azure, p = 0.058, and both perform better than FaceReader, p < 0.001, and Face++, p < 0.001. While this pattern is the same for joy (Azure: p = 0.554; FaceReader: p < 0.001, Face++: p < 0.001) and sadness (Azure: p = 0.448; FaceReader: p < 0.001, Face++: p < 0.001), humans outperform all algorithms in the detection of anger (Azure: p < 0.001; FaceReader: p < 0.001, Face++: p < 0.001) and fear facial expressions (Azure: p < 0.001; FaceReader: p < 0.001, Face++: p < 0.001). In contrast, Azure performs better than humans regarding neutral, p < 0.001, and disgust faces, p < 0.001, while FaceReader (neutral: p < 0.001; disgust: p = 0.002) and Face++ (neutral: p = 0.001; disgust: p = 0.023) show equal or worse performance compared to humans. Finally, Azure, p = 0.006, and Face++, p < 0.001, performs better than humans in the detection of non-standardized surprise facial expressions where FaceReader performs similar to humans, p = 0.535.

Taken together, for most emotion categories there is at least one FER system that performs equally well or better compared to humans. The only exceptions are non-standardized expressions of fear and anger, where humans clearly outperform all FER systems. FaceReader shows particularly good performance for standardized facial expressions and Azure performs better on non-standardized facial expressions.

4. Discussion

In this paper, we evaluate and compare three widely used FER systems, namely Azure, Face++ and FaceReader, and human emotion recognition data. For the performance comparison, we use two different kinds of emotional facial expression data sets: First, a standardized data set comprised of lab generated images displaying intense, prototypical facial expressions of emotions under very good recording conditions (i.e., lighting, camera angle). Second, we test a non-standardized set, which contains facial expressions from movie scenes depicting emotional faces as an approximation for more naturalistic, spontaneous facial expressions ( Dhall et al., 2018 ). The non-standardized facial expressions constitute an especially difficult test case, since it contains large variation in the expressions itself, the surrounding circumstances and the displayed person's characteristics.

Overall, the three classifiers as well as humans perform well on standardized facial expressions. However, we observe large variation and a general decrease in performance for the non-standardized data, in line with previous work ( Yitzhak et al., 2017 ; Dupré et al., 2020 ). Although emotion recognition performance is generally lower for such facial expressions, FER tools perform similarly or better than humans for most emotion categories of non-standardized (except for anger and fear) and standardized facial expressions. Facial expressions of joy are detected best among the emotion categories in both standardized and non-standardized facial expressions, which also replicates existing findings ( Stöckli et al., 2018 ; Höfling et al., 2021 ). However, FER performance varies strongly between systems and emotion categories. Depending on the data and on which emotions one aims at classifying, one algorithm might be better suited than the other: Face++ shows almost no drop out in face detection even under the non-standardized condition, FaceReader shows excellent performance for standardized prototypical facial expressions and outperforms humans, and Azure shows superior overall performance on non-standardized facial expressions among all FER tools.

4.1. Implications for Application

From our data, we can derive three broad implications. First, all FER tools perform much better on the standardized, prototypical data, than on the non-standardized, more naturalistic data. This might indicate over fitting on standardized data. Second, FER systems and human coders can detect some emotion categories better than others, resulting in asymmetries in classification performance between emotion categories. This indicates that the detection of certain emotional facial expressions is generally more error prone than others. Third, we can identify performance problems that are specific to FER tools.

First, as expected, all FER systems perform better on the standardized compared to non-standardized and more naturalistic facial expressions. This is the case for both face detection and emotion classification. Within the standardized data, face detection is near to prefect for all systems and shows almost no drop out based on face detection failures. Regarding the emotion classification, FaceReader outperforms Face++, Azure, and even human coders. Within the non-standardized data, face detection is observed to be problematic for Azure and FaceReader. Judging the classification performance on the non-standardized data set, all three classifiers show a large overall decrease in accuracy, whereby Azure is most accurate compared to Face++ and FaceReader. In particular, all FER systems, and less pronounced in humans, show a misclassification of emotional facial expressions as neutral facial expressions for the non-standardized data. This is an important observation not shown by Dupré et al. (2020 ), since they have not reported confusions with the neutral category. We suspect the neutral classification due to the expressions in acted films being less intense compared to standardized, lab generated data. Hence, the vastly better performance on standardized, prototypical facial expressions which were generated under controlled conditions may indicate limitations of FER systems to more naturalistic and more subtle emotional facial expressions.

Second, we observe that FER and human performance reflect varying underlying difficulties in the classification of different emotions. In other words, certain emotions are harder to detect than others, for example because of more subtle expressions or less distinct patterns. This evolves from shared classification error patterns between the three algorithms which corresponds to prior research on other algorithms and human recognition performance. In our data, joy is recognized best and fear is among the most difficult to classify which is in line with prior FER ( Stöckli et al., 2018 ; Skiendziel et al., 2019 ; Dupré et al., 2020 ) and human emotion recognition research ( Nummenmaa and Calvo, 2015 ; Calvo and Nummenmaa, 2016 ). Anger has been found to be difficult to classify in some studies ( Stöckli et al., 2018 ; Dupré et al., 2020 ), but not in others ( Skiendziel et al., 2019 ). With regards to our findings, angry faces can be classified with low sensitivity, but high precision. Sadness and disgust are reported to be difficult to detect in other studies ( Lewinski et al., 2014 ; Skiendziel et al., 2019 ). Fear is regularly misclassified as surprise, as found in other studies with FER ( Stöckli et al., 2018 ; Skiendziel et al., 2019 ) and humans alike ( Palermo and Coltheart, 2004 ; Calvo and Lundqvist, 2008 ; Tottenham et al., 2009 ; Calvo et al., 2018 ). For the non-standardized data, FER performance on disgust is among the lowest for all classifiers which corresponds to human recognition data in the present study. In line with previous research, the pronounced performance drop for many non-standardized images compared to standardized emotion categories ( Yitzhak et al., 2017 ; Dupré et al., 2020 ) might indicate that the FER systems are not trained on detecting the full variability of emotional facial expressions. Importantly, these results reflect that FER simulates human perception and also shows similar classification errors.

Third, we make a series of observations, that specific FER systems misclassify certain emotion categories, which is not shared by human coders. In our data, fear is also misclassified as sadness by Azure in standardized and non-standardized facial expressions. For the non-standardized data, we also report a general tendency to misclassify surprise expressions, that is not evident in other studies. Especially the misclassification toward surprise in the non-standardized data might be explained by an open mouth due to speaking in movies, for which the applications do not account. In addition, Face++ misclassifies any emotion in the non-standardized data as fear and to a lesser extend as sadness. Regarding FaceReader, we observe a pronounced misclassification of naturalistic facial expressions as neutral. These findings indicate misclassification pattern specific for the three investigated FER systems which possibly reflect differences in their machine-learning architecture, training material and validation procedure.

4.2. Limitations and Outlook

This study has some limitations. Most obviously, we compare three representative and not all available software systems on the market. While we choose software that is widely used, other algorithms will need to be examined in a similar fashion. For example, Beringer et al. (2019) find that FACET shows a certain resilience to changes in lighting and camera angle on lab generated data. We could not see in this study if this resilience transfers to an even harder task.

To approximate more naturalistic facial expressions, we utilize images from movie stills as the non-standardized data set. While this is convenient and emotional expressions are already classified and evaluated, these images are of course also posed by actors. However, good acting is generally thought of as a realistic portrayal of true affect. Our ratings of genuineness appear to support our distinction of standardized and non-standardized facial expressions. In addition, our human recognition data provide further validation of emotion categorization of this particular facial expression inventory. Even though acted portrays of emotional facial expressions differ between prototypical inventories and movies, which is in line with previous research ( Carroll and Russell, 1997 ), these acted facial expressions are only approximations for true emotional expressions. Moreover, movie stimuli may be rated as more authentic compared to the prototypical data, due to many reasons like the variation in head orientations, lighting, backgrounds, and familiarity with the actors or movie plot. Hence, facial expressions of true emotion require an additional criterion of emotional responding like ratings of currently elicited emotions.

Furthermore, we argue that FER would be most useful in categorizing spontaneous and naturalistic facial expressions in different contexts. The SFEW data set serves as an approximation for this. However, it is unclear whether the displayed emotional facial expressions are grounded in emotional processing or just simulated. For example, Höfling et al. (2020 ) elicited spontaneous emotional responses by presenting emotional scenes to their participants and found FER detects changes in facial expressions only for pleasant emotional material. Hence, more data sets are needed to test different naturalistic settings and foster development in this area.

Beyond the bias in FER toward prototypical expressions under good condition, there are other sources of systemic error that we did not address, such as biases against race, gender, age, or culture ( Zou and Schiebinger, 2018 ; Aggarwal et al., 2019 ; Wellner and Rothman, 2020 ). For example, it has been shown that automated facial analysis to classify gender works less well for people with a darker skin tone ( Buolamwini and Gebru, 2018 ). Many training data sets are concentrated on Northern America and Europe ( Shankar et al., 2017 ), which partially causes the biases and at the same time makes it difficult to detect them. Future research should take these variables into account to evaluate measurement fairness independent of specific person characteristics.

5. Conclusion

This study contributes to the literature by comparing the accuracy of three state-of-the-art FER systems to classify emotional facial expressions (i.e., FaceReader, Azure, Face++). We show that all systems and human coders perform well for standardized, prototypical facial expressions. When challenged with non-standardized images, used to approximate more naturalistic expressions collected outside of the lab, performance of all systems as well as human coders drops considerably. Reasons for this are substantial drop out rates and a decrease in classification accuracy specific to FER systems and emotion categories. With only a short history, FER is already a valid research tool for intense and prototypical emotional facial expressions. However, limitations are apparent in the detection of non-standardized facial expressions as they may be displayed in more naturalistic scenarios. Hence, further research is urgently needed to increase the potential of FER as a research tool for the classification of non-prototypical and more subtle facial expressions. While the technology is, thus, a promising candidate to assess emotional facial expressions on a non-contact basis, researchers are advised to interpret data from non-prototypical expressions in non-restrictive settings (e.g., strong head movement) carefully.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics Statement

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

Author Contributions

TK conceived and designed the study, contributed Face++ and Azure data, conducted the analysis and interpretation of the results, and also drafted the work. TH contributed FaceReader data and collected data from human raters. TH and GA contributed to the interpretation of the results and writing of the manuscript. All authors contributed to the article and approved the submitted version.

This publication was funded by the open access publication fund of the University of Konstanz.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to express our gratitude to those, who commented on earlier version of this paper, especially Susumu Shikano, Karsten Donnay, Sandra Morgenstern, Kiela Crabtree, Sarah Shugars, and the participants of the Image Processing for Political Research workshop at APSA 2020. We also would like to thank Timo Kienzler for technical support. We thank the two reviewers for their constructive suggestions, which further improved the paper.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2021.627561/full#supplementary-material

1. ^ We also test different thresholds, but there is no reasonable performance improvement to be gained (see Supplementary Figure 1 ).

2. ^ Since this procedure leads to different samples for each algorithm, especially among the non-standardized data, we also compute the analysis for the subsample of non-standard images, which are recognized by all algorithms. The results are reported in Supplementary Table 3 . Differences are minor, qualitatively the results remain the same.

3. ^ Twenty percent of the images have a quality that is too low for FaceReader to reliably detect emotions and we therefore exclude these from the analysis, in 54% no face is found by FaceReader.

4. ^ Since drop out rates differ strongly between the algorithms, especially among the naturalistic data, we also compute the analysis for the subset of naturalistic images, which are recognized by all algorithms. Differences are minor with corresponding patterns (see also Supplementary Table 4 ). Additionally, we report the shares of correctly identified images based on all images in Supplementary Table 3 .

5. ^ AUC analysis for the subset of non-standardized data passed by all algorithms yields the same results.

Abdullah, S., Murnane, E. L., Costa, J. M. R., and Choudhury, T. (2015). “Collective smile: measuring societal happiness from geolocated images,” in CSCW '15: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, Tallinn, Estonia. 361–374. doi: 10.1145/2675133.2675186

CrossRef Full Text | Google Scholar

Aggarwal, A., Lohia, P., Nagar, S., Dey, K., and Saha, D. (2019). “Black box fairness testing of machine learning models,” in ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , Tallinn, Estonia. 625–635. doi: 10.1145/3338906.3338937

Arriaga, O., Valdenegro-Toro, M., and Plöger, P. (2017). Real-time convolutional neural networks for emotion and gender classification. CoRR, abs/1710.07557 .

Google Scholar

Barrett, L. F., Adolphs, R., Marsella, S., Martinez, A. M., and Pollak, S. D. (2019). Emotional expressions reconsidered: challenges to inferring emotion from human facial movements. Psychol. Sci. Publ. Interest 20, 1–68. doi: 10.1177/1529100619832930

PubMed Abstract | CrossRef Full Text | Google Scholar

Bartkiene, E., Steibliene, V., Adomaitiene, V., Juodeikiene, G., Cernauskas, D., Lele, V., et al. (2019). Factors affecting consumer food preferences: food taste and depression-based evoked emotional expressions with the use of face reading technology. BioMed Res. Int . 2019:2097415. doi: 10.1155/2019/2097415

Bartlett, M., Hager, J. C., Ekman, P., and Sejnowski, T. J. (1999). Measuring facial expressions by computer image analysis. Psychophysiology 36, 253–263. doi: 10.1017/S0048577299971664

Beringer, M., Spohn, F., Hildebrandt, A., Wacker, J., and Recio, G. (2019). Reliability and validity of machine vision for the assessment of facial expressions. Cogn. Syst. Res . 56, 119–132. doi: 10.1016/j.cogsys.2019.03.009

Boxell, L. (2018). Slanted Images: Measuring Nonverbal Media Bias. Munich Personal RePEc Archive Paper No. 89047 . Munich: Ludwig Maximilian University of Munich.

Brader, T. (2005). Striking a responsive chord: how political ads motivate and persuade voters by appealing to emotions. Am. J. Polit. Sci . 49, 388–405. doi: 10.1111/j.0092-5853.2005.00130.x

Buolamwini, J., and Gebru, T. (2018). Gender shades: intersectional accuracy disparities in commercial gender classification. Proc. Mach. Learn. Res . 81, 1–15.

Calvo, M., and Lundqvist, D. (2008). Facial expressions of emotion (kdef): identification under different display-duration conditions. Behav. Res. Methods 40, 109–115. doi: 10.3758/BRM.40.1.109

Calvo, M. G., Fernández-Martín, A., Recio, G., and Lundqvist, D. (2018). Human observers and automated assessment of dynamic emotional facial expressions: Kdef-dyn database validation. Front. Psychol . 9:2052. doi: 10.3389/fpsyg.2018.02052

Calvo, M. G., and Nummenmaa, L. (2016). Perceptual and affective mechanisms in facial expression recognition: an integrative review. Cogn. Emot . 30, 1081–1106. doi: 10.1080/02699931.2015.1049124

Carroll, J. M., and Russell, J. A. (1997). Facial expressions in hollywood's portrayal of emotion. J. Pers. Soc. Psychol . 72, 164–176. doi: 10.1037/0022-3514.72.1.164

Chang, W. (2014). extrafont: Tools for Using Fonts . R package version 0.17.

Clore, G. L., Gasper, K., and Garvin, E. (2001). “Affect as information,” in Handbook of Affect and Social Cognition , ed J. Forgas (Mahwah, New Yersey:Psychology Press), 121–144.

Cohn, J. F., Ambadar, Z., and Ekman, P. (2007). “Observer-based measurement of facial expression with the facial action coding system,” in Handbook of Emotion Elicitation and Assessment , eds J. Coan and J. Allen (Oxford:Oxford University Press), 222–238.

Dhall, A., Goecke, R., Lucey, S., and Gedeon, T. (2012). Collecting large, richly annotated facial-expression database from movies. IEEE MultiMed . 19, 34–41. doi: 10.1109/MMUL.2012.26

Dhall, A., Kaur, A., Goecke, R., and Gedeon, T. (2018). “Emotiw 2018: audio-video, student engagement and group-level affect prediction,” in ICMI' 18 Boulder, CO, 653–656. doi: 10.1145/3242969.3264993

Dowle, M., and Srinivasan, A. (2020). data.table: Extension of 'data.frame' . R Package Version 1.13.2.

Dupré, D., Krumhuber, E. G., Küster, D., and McKeown, G. J. (2020). A performance comparison of eight commercially available automatic classifiers for facial affect recognition. PLoS ONE 15:e231968. doi: 10.1371/journal.pone.0231968

Ekman, P., and Rosenberg, E. L. (1997). What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS) . New York, NY:Oxford University Press.

Fraser, K., Ma, I., Teteris, E., Baxter, H., Wright, B., and McLaughlin, K. (2012). Emotion, cognitive load and learning outcomes during simulation training. Med. Educ . 46, 1055–1026. doi: 10.1111/j.1365-2923.2012.04355.x

Goodfellow, I. J., Erhan, D., Carrier, P., Courville, A., Mirza, M., Hamner, B., et al. (2015). Challenges in representation learning: a report on three machine learning contests. Neural Netw . 64, 59–63. doi: 10.1016/j.neunet.2014.09.005

Haim, M., and Jungblut, M. (2020). Politicians' self-depiction and their news portrayal: Evidence from 28 countries using visual computational analysis. Polit. Commun . doi: 10.1080/10584609.2020.1753869

Henry, L., and Wickham, H. (2020). purrr: Functional Programming Tools . R package version 0.3.4.

Höfling, T. T. A., Alpers, G. W, Gerdes, A. B. M., and Föhl, U. (2021). Automatic facial coding versus electromyography of mimicked, passive, and inhibited facial response to emotional faces. Cogn. Emot . doi: 10.1080/02699931.2021.1902786

Höfling, T. T. A., Gerdes, A. B. M., Föhl, U., and Alpers, G. W. (2020). Read my face: automatic facial coding versus psychophysiological indicators of emotional valence and arousal. Front. Psychol . doi: 10.3389/fpsyg.2020.01388

Jin, B., Qu, Y., Zhang, L., and Gao, Z. (2020). Diagnosing Parkinson disease through facial expression recognition: video analysis. J. Med. Intern. Res . 22:e18697. doi: 10.2196/18697

Keltner, D., and Cordaro, D. T. (2017). “Understanding multimodal emotional expressions,” in The Science of Facial Expression , eds J. Russel and J. Fernandez Dols (New York, NY:Oxford University Press), 57–76. doi: 10.1093/acprof:oso/9780190613501.003.0004

Kuhn, M. (2020). caret: Classification and Regression Training . R package version 6.0-86.

Kulke, L., Feyerabend, D., and Schacht, A. (2020). A comparison of the affectiva imotions facial expression analysis software with EMG for identifying facial expressions of emotion. Front. Psychol . 11:329. doi: 10.3389/fpsyg.2020.00329

Langer, O., Dotsch, R., Bijlstra, G., Wigboldus, D. H. J., Hawk, S. T., and van Knippenberg, A. (2010). Presentation and validation of the radboud faces database. Cogn. Emot . 24, 1377–1388. doi: 10.1080/02699930903485076

Lerner, J. S., and Keltner, D. (2000). Beyondvvalence: Toward a model of emotion-specific influences on judgement and choice. Cogn. Emot . 14, 473–493. doi: 10.1080/026999300402763

Lewinski, P. (2015). Automated facial coding software outperforms people in recognizing neutral faces as neutral from standardized datasets. Front. Psychol . 6:1386. doi: 10.3389/fpsyg.2015.01386

Lewinski, P., den Uyl, T., and Butler, C. (2014). Automated facial coding: validation of basic emotions and facs aus in facereader. J. Neurosci. Psychol. Econ . 7, 227–236. doi: 10.1037/npe0000028

Lundqvist, D., Flykt, A., and Öhman, A. (1998). The Karolinska Directed Emotional Faces - KDEF. CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet. ISBN 91-630-7164-9. doi: 10.1037/t27732-000

Marcus, G. E. (2000). Emotions in politics. Annu. Rev. Polit. Sci . Chicago. 3, 221–250. doi: 10.1146/annurev.polisci.3.1.221

Marcus, G. E., Neuman, W. R., and MacKuen, M. (2000). Affective Intelligence and Political Judgement . The University of Chicago Press.

Mavadati, M. S., Mahoor, M. H., Bartlett, K., Trinh, P., and Cohn, J. F. (2013). Disfa: A spontaneous facial action intensity database. IEEE Trans. Affect. Comput . 4, 151–160. doi: 10.1109/T-AFFC.2013.4

Meffert, M. F., Chung, S., Joiner, A. J., Waks, L., and Garst, J. (2006). The effects of negativity and motivated information processing during a political campaign. J. Commun . 56, 27–51. doi: 10.1111/j.1460-2466.2006.00003.x

Neuwirth, E. (2014). RColorBrewer: ColorBrewer Palettes . Long Beach, CA:R package version 1.1-2.

Nummenmaa, L., and Calvo, M. G. (2015). Dissociation between recognition and detection advantage for facial expressions: a meta-analysis. Emotion 15, 243–256. doi: 10.1037/emo0000042

Olszanowski, M., Pochwatko, G., Kuklinski, K., Scibor-Rylski, M., Lewinski, P., and Ohme, R. (2015). Warsaw set of emotional facial expression pictures: a validation study of facial display photographs. Front. Psychol . 5:1516. doi: 10.3389/fpsyg.2014.01516

Ooms, J. (2014). The jsonlite package: a practical and consistent mapping between JSON data and R objects. arXIv [Preprint] arXiv: 1403.2805 [stat.CO].

Palermo, R., and Coltheart, M. (2004). Photographs of facial expression: accuracy, response times, and ratings of intensity. Behav. Res. Methods Instrum. Comput . 36, 634–638. doi: 10.3758/BF03206544

Pedersen, T. (2020). patchwork: The Composer of Plots . R package version 1.1.0.

Peng, Y. (2018). Same candidates, different faces: uncovering media bias in visual portrayals of presidential candidates with computer vision. J. Commun . 65, 920–941. doi: 10.1093/joc/jqy041

Pittig, A., Schulz, A. R., Craske, M. G., and Alpers, G. W. (2014). Acquisition of behavioral avoidance: task-irrelevant conditioned stimuli trigger costly decisions. J. Abnorm. Psychol . 123, 314–329. doi: 10.1037/a0036136

Quinn, M. A., Sivesind, G., and Reis, G. (2017). Real-time Emotion Recognition From Facial Expressions . Available online at: http://cs229.stanford.edu/proj2017/final-reports/5243420.pdf

R Core Team (2019). R: A Language and Environment for Statistical Computing . Vienna: R Foundation for Statistical Computing.

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J., et al. (2011). proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics 12:77. doi: 10.1186/1471-2105-12-77

Sachs, M. C. (2017). plotROC: A tool for plotting roc curves. J. Stat. Softw. Code Snipp . 79, 1–19. doi: 10.18637/jss.v079.c02

Sato, W., Hyniewska, S., Minemoto, K., and Yoshikawa, S. (2019). Facial expressions of basic emotions in Japanese laypeople. Front. Psychol . 10:259. doi: 10.3389/fpsyg.2019.00259

Scherer, K., and Ellgring, H. (2007). Multimodal expression of emotion: Affect programs or componential appraisal patterns? Emotion 7, 158–171. doi: 10.1037/1528-3542.7.1.158

Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., and Sculley, D. (2017). “No classification without representation: assessing geodiversity issues in open data sets for the developing world,” in 31st Conference on Neural Information Processing Systems (NIPS 2017) .

Skiendziel, T., Rösch, A. G., and Schultheiss, O. C. (2019). Assessing the convergent validity between the automated emotion recognition software noldus facereader 7 and facial action coding system scoring. PLoS ONE . 14:e0223905. doi: 10.1371/journal.pone.0223905

Slovic, P., Finucane, M., Peters, E., and MacGregor, D. G. (2007). The affect heuristic. Eur. J. Oper. Res . 177, 1333–1352. doi: 10.1016/j.ejor.2005.04.006

Soroka, S., and McAdams, S. (2015). News, politics and negativity. Polit. Commun . 32, 1–22. doi: 10.1080/10584609.2014.881942

Stöckli, S., Schulte-Mecklenbeck, M., Borer, S., and Samson, A. C. (2018). Facial expression analysis with affdex and facet: a validation study. Behav. Res. Methods 50, 1446–1460. doi: 10.3758/s13428-017-0996-1

Sullivan, D. G., and Masters, R. D. (1988). “happy warriors”: Leaders' facial displays, viewers' emotions, and political support. Am. J. Polit. Sci . 32, 345–368. doi: 10.2307/2111127

Teixeira, T., Picard, R., and el Kaliouby, R. (2014). Why, when, and how much to entertain consumers in advertisements? A web-based facial tracking field study. Market. Sci . 33, 809–827. doi: 10.1287/mksc.2014.0854

Tian, Y.-l, Kanade, T., and Cohn, J. F. (2001). Recognizing action units for facial expression analysis. IEEE Trans. Pattern Anal. Mach. Intell . 23, 97–115. doi: 10.1109/34.908962

Tottenham, N., Tanaka, J., Leon, A., McCarry, T., Nurse, M., Hare, T., et al. (2009). The nimstim set of facial expressions: judgments from untrained research participants. Psychiatry Res . 168, 242–249. doi: 10.1016/j.psychres.2008.05.006

Van der Schalk, J., Hawk, S., Fischer, A., and Doosja, B. (2011). Moving faces, looking places: validation of the Amsterdam dynamic facial expression set (ADFES). Emotion 11, 907–920. doi: 10.1037/a0023853

Wellner, G., and Rothman, T. (2020). Feminist AI: can we expect our ai systems to become feminist? Philos. Technol . 33, 191–205. doi: 10.1007/s13347-019-00352-z

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis . New York, NY: Springer-Verlag. doi: 10.1007/978-3-319-24277-4_9

Wickham, H. (2019). stringr: Simple, Consistent Wrappers for Common String Operations . R package version 1.4.0.

Wickham, H. (2020). httr: Tools for Working With URLs and HTTP . R package version 1.4.2.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., et al. (2019). Welcome to the tidyverse. J. Open Source Softw . 4:1686. doi: 10.21105/joss.01686

Wickham, H., François, R., Lionel, H., and Müller, K. (2020). dplyr: A Grammar of Data Manipulation . R package version 1.0.2.

Wu, F., Lin, S., Cao, X., Zhong, H., and Zhang, J. (2019). “Head design and optimization of an emotionally interactive robot for the treatment of autism,” in CACRE2019: Proceedings of the 2019 4th International Conference on Automation, Control and Robotics Engineering , 1–10. doi: 10.1145/3351917.3351992

Yazdavar, A., Mahdavinejad, M., Bajaj, G., Romine, W., Sheth, A., Monadjemi, A., et al. (2020). Multimodal mental health analysis in social media. PLoS ONE 15:e0226248. doi: 10.1371/journal.pone.0226248

Yitzhak, N., Giladi, N., Gurevich, T., Messinger, D. S., Prince, E. B., Martin, K., et al. (2017). Gently does it: humans outperform a software classifier in recognizing subtle, nonstereotypical facial expressions. Emotion 17, 1187–1198. doi: 10.1037/emo0000287

Zhang, X., Yin, L., Cohn, J. F., Canavan, S., Reale, M., Horowitz, A., et al. (2014). BP4d-spontaneous: a high-resolution spontaneous 3D dynamic facial expression database. Image Vis. Comput . 32, 692–706. doi: 10.1016/j.imavis.2014.06.002

Zou, J., and Schiebinger, L. (2018). Ai can be sexist and racist-it's time to make it fair. Nature 559, 324–326. doi: 10.1038/d41586-018-05707-8

Keywords: recognition of emotional facial expressions, software evaluation, human emotion recognition, standardized inventories, naturalistic expressions, automatic facial coding, facial expression recognition, specific emotions

Citation: Küntzler T, Höfling TTA and Alpers GW (2021) Automatic Facial Expression Recognition in Standardized and Non-standardized Emotional Expressions. Front. Psychol. 12:627561. doi: 10.3389/fpsyg.2021.627561

Received: 11 November 2020; Accepted: 11 March 2021; Published: 05 May 2021.

Reviewed by:

Copyright © 2021 Küntzler, Höfling and Alpers. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Theresa Küntzler, theresa.kuentzler@uni-konstanz.de

† These authors share first authorship

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 24 May 2023

A study on computer vision for facial emotion recognition

  • Zi-Yu Huang 1 ,
  • Chia-Chin Chiang 1 ,
  • Jian-Hao Chen 2 ,
  • Yi-Chian Chen 3 ,
  • Hsin-Lung Chung 1 ,
  • Yu-Ping Cai 4 &
  • Hsiu-Chuan Hsu 2 , 5  

Scientific Reports volume  13 , Article number:  8425 ( 2023 ) Cite this article

12k Accesses

9 Citations

2 Altmetric

Metrics details

  • Health care
  • Health occupations

Artificial intelligence has been successfully applied in various fields, one of which is computer vision. In this study, a deep neural network (DNN) was adopted for Facial emotion recognition (FER). One of the objectives in this study is to identify the critical facial features on which the DNN model focuses for FER. In particular, we utilized a convolutional neural network (CNN), the combination of squeeze-and-excitation network and the residual neural network, for the task of FER. We utilized AffectNet and the Real-World Affective Faces Database (RAF-DB) as the facial expression databases that provide learning samples for the CNN. The feature maps were extracted from the residual blocks for further analysis. Our analysis shows that the features around the nose and mouth are critical facial landmarks for the neural networks. Cross-database validations were conducted between the databases. The network model trained on AffectNet achieved 77.37% accuracy when validated on the RAF-DB, while the network model pretrained on AffectNet and then transfer learned on the RAF-DB results in validation accuracy of 83.37%. The outcomes of this study would improve the understanding of neural networks and assist with improving computer vision accuracy.

Similar content being viewed by others

facial emotion recognition research papers 2021

Emotions and brain function are altered up to one month after a single high dose of psilocybin

Frederick S. Barrett, Manoj K. Doss, … Roland R. Griffiths

facial emotion recognition research papers 2021

A neural speech decoding framework leveraging deep learning and speech synthesis

Xupeng Chen, Ran Wang, … Adeen Flinker

facial emotion recognition research papers 2021

AI in health and medicine

Pranav Rajpurkar, Emma Chen, … Eric J. Topol

Introduction

In human communications, facial expressions contain critical nonverbal information that can provide additional clues and meanings to verbal communications 1 . Some studies have suggested that 60–80% of communication is nonverbal 2 . This nonverbal information includes facial expressions, eye contact, tones of voice, hand gestures and physical distancing. In particular, facial expression analysis has become a popular research topic 3 . Facial emotional recognition (FER) has been applied in the field of human–computer interaction (HCI) in areas such as autopilot, education, medical treatment, psychological treatment 4 , surveillance and psychological analysis in computer vision 5 , 6 .

In psychology and computer vision, emotions are classified as categorical or dimensional (valence and arousal) models 7 , 8 , 9 . In the categorical model, Ekman et al . 7 defined basic human emotions as happiness, anger, disgust, fear, sadness, and surprise. In the dimensional model, the emotion is evaluated by continuous numerical scales for determination of valence and arousal. FER is an important task in computer vision that has numerous practical applications and the number of studies on FER has increased in recent years 10 , 11 , 12 , 13 , benefiting from the advances provided by deep neural networks. In particular, convolutional neural networks (CNNs) have attained excellent results in terms of extracting features. For example, He et al . 14 proposed the residual neural network (ResNet) architecture in 2015, which added residual learning to a CNN to resolve the issues of vanishing gradient and decreasing accuracy of deep networks.

Several authors have applied neural network models to classify emotions according to categorical models 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 and dimensional models 15 , 23 , 24 , 25 , 26 . Huang 27 applied a residual block architecture to a VGG CNN to perform emotion recognition and obtained improved accuracy. Mao et al . 28 proposed a new FER model called POSTER V2, which aims to improve the performance of the state-of-the-art technique and reduce the required computational cost by introducing window-based cross attention mechanism and facial landmarks’ multi-scale features. To incorporate more information into the automatic emotion recognition process, some recent studies have fused several modalities, such as the temporal, audio and visual modalities 10 , 17 , 18 , 23 , 25 , into the algorithm. Moreover, attention mechanisms have been adopted by several studies 17 , 18 , 19 , 20 , 22 , 25 for FER tasks. Zhang et al . 19 applied class activation mapping to analyze the attention maps learned by their model. It was found that the model could be regularized by flipping its attention map and randomly erasing part of the input images. Wang et al. 22 introduced an attention branch to learn a face mask that highlights the discriminative parts for FER. These studies show that attention mechanisms play a critical role in FER. Several approaches for FER utilize self-attention mechanisms to capture both local and global contexts through a set of convolutional layers for feature extraction 29 , 30 , 31 . The extracted features are then used as the inputs of a relation attention module, which utilizes self-attention to capture the relationships between different patches and the context.

However, the practical deployment of facial recognition systems remains a challenging task, as a result of the presence of noise, ambiguous annotations 32 , and complicated scenes in the real-world setting 33 , 34 , 35 . Since attention modules have been proven effective for computer vision tasks, applying attention modules to FER tasks is of great interest. Moreover, in psychology, the facial features for FER by human have been analyzed. The results presented by Beaudry et al . 35 suggest that the mouth is the major landmark when observing a happy emotion and that the eyes are the major landmarks when observing a sad emotion. Similarly, the DNN model extracts discriminative features for FER. It is beneficial to apply class activation mapping to identify the discriminative features learned by the network at each layer. It has been shown that the class activation mapping method can be utilized for localization recognition around the eyes for movement analysis purposes 37 , 38 . The produced feature maps could provide a better understanding of the performance of the developed model.

In this study, the squeeze-and-excitation module (SENet) was used with ResNet-18 to achieve a relatively light model for FER. This model has fewer trainable parameters (approximately 11.27 million) than the approximately 23 million parameters required for ResNet-50 and the approximately 86 million parameters of the vision transformer. The effectiveness of the proposed approach was evaluated on two FER datasets, namely, AffectNet and the Real-World Affective Faces Database (RAF-DB). Both datasets contain a great quantity of facial emotion data, including those from various cultures and races. The number of images in AffectNet is about 20 times than that of RAF-DB. The images in AffectNet are more diverse and wilder than those in RAF-DB. The neural network was trained to extract emotional information from AffectNet and RAF-DB. A cross-database validation between the AffectNet dataset and the RAF-DB was conducted. The results show that a training accuracy of 79.08% and a validation accuracy of 56.54% were achieved with AffectNet. A training accuracy of 76.51% and a validation accuracy of 65.67% were achieved with RAF-DB. The transfer-learning was applied on RAF-DB with pretrained weight obtained with AffectNet. The prediction accuracy after transfer-learning increases dramatically on the RAF-DB dataset. The results suggest that transfer learning can be conducted for smaller dataset with a particular culture, region, or social setting 36 for specific applications. Transfer-learning enables the model to learn the facial emotions of a particular population with a smaller database and achieve accurate results. Moreover, the images in AffectNet and RAF-DB with softmax score exceeding 90% were selected to identify the important facial landmarks that were captured by the network. It is found that in the shallow layers, the extracted dominant features are fine lines, whereas in the deep layers, the regions near mouth and nose are more important.

Database and model

The AffectNet database contains 456,349 images of facial emotions obtained from three search engines, Google, Bing and Yahoo, in six different languages. The images were labeled with the following 11 emotions: neutrality, happiness, sadness, surprise, fear, disgust, anger, contempt, none, uncertain, and nonface. Among these emotions, “uncertain” means that the given image cannot be classified into one of the other categories, and “nonface” means that the image contains exaggerated expressions, animations, drawings, or watermarks. Mollahosseini et al . 15 hired annotators to manually classify emotions defined in AffectNet. In addition, AffectNet is heavily imbalanced in terms of the number of images of each emotion category. For example, the number of images representing “happy” is almost 30 times higher than the number of images representing “disgust”. The number of images for each category is shown in Table 1 . Figure  1 shows sample images for the 11 emotions contained in AffectNet. In this study, we use seven categories, surprise, fear, disgust, anger, sadness, happiness and neutrality, in AffectNet.

figure 1

Image categories of the faces contained in the AffectNet database 12 .

The RAF-DB is provided by the Pattern Recognition and Intelligent System Laboratory (PRIS Lab) of the Beijing University of Posts and Telecommunications 39 . The database consists of more than 300,000 facial images sourced from the internet, which are classified into seven categories: surprise, fear, disgust, anger, sadness, happiness and neutrality. Each of the images contains 5 accurate landmark locations and 37 automatic landmark locations. The RAF-DB also contains a wide variety of information in terms of ages, races, head gestures, light exposure levels and blocking. The training set contains five times as many images as the test set. Figure  2 shows sample images for the seven emotions contained in the RAF-DB. Table 1 shows the number of images used in this article for each emotion from each database.

figure 2

Image categories of the faces contained in the RAF-DB database 37 .

SENet is a new image recognition architecture developed in 2017 40 . The network reinforces critical features by comparing the correlations among feature channels to achieve increased classification accuracy. Figure  3 shows the SENet architecture, which contains three major operations. The squeeze operation extracts global feature information from the previous convolution layer and conducts global average pooling on the feature map to obtain a feature tensor (Z) of size 1 × 1 ×  \({\text{C}}\) (number of channels), in which the \({\text{c}} - {\text{th}}\) element is calculated by:

where \(F_{sq}\) is the global average pooling operation, \(u_{c}\) is the \({\text{c}} - {\text{th}}\) 2-dimensional matrix, W × H represents the dimensions of each channel, and C is the number of channels.

figure 3

The schema of the SENet inception module.

Equation ( 1 ) is followed by two fully connected layers. The first layer reduces the number of channels from \({\text{C}}\) to \({\text{C}}/{\text{r}}\) to reduce the required number computations (r is the compression rate), and the second layer increases the number of channels to \({\text{C}}\) . The excitation operation is defined as follows:

where \({\upsigma }\) is the sigmoid activation function, \(\delta\) is the rectified linear unit (ReLU) excitation function, and \(W_{1}\) and \(W_{2}\) are the weights for reducing and increasing the dimensionality, respectively.

The scale operation multiplies the feature tensor by the excitation. This operation captures the significance of each channel via feature learning. The corresponding channel is then multiplied by the gained weight to discern the major and minor information for the computer 38 . The formula for the scale operation, which is used to obtain the final output of the block, is shown as follows.

where the dot is the channelwise multiplication operation and \(S_{c}\) is the output of the excitation operation.

ResNet was proposed by He et al . 11 to solve the vanishing gradient problem in a deep network. ResNet introduces a residual block to a conventional CNN. Figure  4 shows the residual block in the ResNet architecture. The concept of a residual block is to combine the output from the previous convolutional layer with the next convolutional layer in the ResNet. It has been shown in several studies that the residual blocks relieve the vanishing gradient issue encountered by a deeper network. Therefore, the residual blocks have been adopted in several architectures 37 , 38 .

figure 4

Residual block of the ResNet architecture.

SE-ResNet combines the SENet and ResNet architectures presented above and adds the SE block from SENet to ResNet. The SE block is used to capture the significance of each channel to determine whether it contains major or minor information. The feature information from the previous convolutional layer is then combined with the next layer by the residual block. This method can mitigate the decreasing accuracy caused by the vanishing gradient problem that occurs while increasing the network layers. Figure  5 shows the network architecture of SE-ResNet.

figure 5

The schema of the SE-Resnet module.

Experimental method

In this study, we extracted seven categories from AffectNet to ensure that AffectNet and the RAF-DB were validated with identical categories. The SE-ResNet architecture was adopted as the neural network model for training and testing. A comparison and cross-database validation were conducted between RAF-DB and AffectNet. To achieve better performance, the transfer learning technique was used. The model trained on AffectNet was applied as the pretrained model to train RAF-DB.

The feature maps derived from each SE block were printed to determine which facial landmarks contain major information for the network. Only facial emotion images with softmax score exceeding 90% were adopted to ensure objectivity and accuracy. Examples of the feature maps printed from AffectNet are shown in Fig.  6 . The feature maps printed from the RAF-DB are shown in Fig.  7 .

figure 6

Feature maps of different SE block layers (AffectNet).

figure 7

Feature maps of different SE block layers (RAF-DB).

In this experiment, the training hardware was an NVIDIA TITAN RTX 24-GB GPU. The input image size was 256 × 256 pixels with data augmentation. For the training process, the tones of the input images were changed. The images were randomly rotated between + / − 30 degrees, and cropped according to the four corners and the center into five images of the size 224 × 224 pixels. For validation purposes, the input images were cropped from the center to a final size of 224 × 224 pixels. The optimization algorithm and loss function were stochastic gradient descent and the cross entropy loss function, respectively. Twenty epochs were used, and the initial learning rate was set to 0.01. The momentum was 0.9 and the batch size for training was 100.

Results and discussion

Cross-database validation.

The AffectNet dataset and the RAF-DB were cross-database validated in this study. The model trained on AffectNet was used to predict the RAF-DB, and the model trained on the RAF-DB was used to predict AffectNet. The results are shown in Table 2 . Because AffectNet exhibits more diversity in terms of facial emotion data and more images, when the model trained on AffectNet predicted the RAF-DB, an accuracy of 77.37% was achieved, which was significantly higher than the accuracy achieved by directly training on the RAF-DB (65.67%). In contrast, low accuracy (42.6%) was obtained for AffectNet predicted by the model trained on the RAF-DB. The difference can be understood by the fact that the images in AffectNet are more in quantity and more complex.

The accuracies achieved on AffectNet and RAF-DB by SE-ResNet were compared in this study. RAF-DB results in a higher accuracy than AffectNet, as shown in Table 3 . However, this was expected since the RAF-DB dataset exhibits more constrained images. The accuracy of the proposed model on AffectNet is 56%, which is slightly lower than the 58% accuracy obtained in the original paper 19 that proposed AffectNet. However, as mentioned in the original paper 15 , the agreement between two human annotators was 60% over 36,000 images. Our result is comparable to this agreement rate.

Additionally, we performed transfer learning by pretraining the model on AffectNet, followed by training on the RAF-DB. As shown in Table 4 , the validation accuracy on the RAF-DB increased by 26.95% ([(accuracy with pretrained model—accuracy without pretrained model)/accuracy without pretrained model = (83.37–65.67) / 65.67] × 100%) and was higher than that of the model trained directly with the RAF-DB. Compared to the accuracy of 76.73% obtained in 21 by multi-region ensemble CNN, transfer learning with a single network performs better than the ensemble CNN that utilizes global and local features. This result indicates that AffectNet provides useful pretrained weights because of the wide diversity of the dataset. The diverse cultural and racial backgrounds of the images in the AffectNet dataset provides a more representative and inclusive training set, leading to a more robust and accurate recognition system. The result highlights the significance of considering the diversity of data and transfer learning in the development and deployment of FER algorithms.

The normalized confusion matrices predicted by the model trained on AffectNet for AffectNet and RAF-DB are shown in Fig.  8 a and b, respectively. The normalized confusion matrices predicted by the model after transfer learning for RAF-DB is given in Fig.  8 c. Figure  8 a and b show that the model tends to falsely classify images as “neutral”. It suggests the discriminative features learned from AffectNet are similar between “neutral” and other categories. Moreover, the comparison between Fig.  8 b and c shows that after transfer learning, the model classifies the emotions in the RAF-DB in a more accurate and even manner.

figure 8

Normalized confusion matrix for AffectNet and RAF-DB ( a ) AffectNet, ( b ) RAF-DB and ( c ) RAF-DB with pretrained model.

It can be seen from the normalized confusion matrices that the classification accuracy is positively correlated with the number of images in the dataset, as given in Table 1 . In Fig.  8 a, the AffectNet dataset contains the least number of “disgust” images, which results in the lowest accuracy in the normalized confusion matrix. In contrast, the number of images of the “happy” category is the most in AffectNet and, therefore, yields the highest accuracy in the normalized confusion matrix for this category. The same conclusion can be obtained from Fig.  8 b and c for RAF-DB.

Feature maps

This study examines the important features that the network learns to classify facial emotions. The feature maps in AffectNet with softmax scores (P) exceeding 90% are visualized in Fig.  9 . It is shown that mouth, nose, and other facial lines are major information, while the eyes and ears for minor information. This is similar to the results found in Beaudry et al . 35 that the mouth is the major landmark when the neural network predicts a happy emotion. The feature maps of misclassified images are also visualized in Fig.  10 for comparisons with those that were correctly classified. By observing the feature maps of misclassified images, it is evident that the important features in the images are similar to those in the correctly classified images. It can be observed from Figs. 9 and 10 that the network tends to detect edges and lines in shallow layers and focuses more on local features, like mouth and nose, in deeper layers.

figure 9

Feature maps with a softmax score greater than 90% (AffectNet).

figure 10

Misclassified feature maps (AffectNet).

Asian facial emotion

The Asian facial emotion dataset 41 consists of images of 29 actors aged from 19 to 67 years old. The images were taken from frontal, 3/4 sideways and sideways angles. Figure  11 shows some example images from the Asian facial emotion dataset. The number of images of each class are given in Table 5 . There are only six labeled categories in this dataset. The “neutrality” category is not provided in this dataset. Therefore, in the output layer of the model, which was trained to predict the probabilities of 7 categories, the probability for “neutrality” was specified as zero.

figure 11

Example images from the Asian facial emotion dataset 39 .

The Asian facial emotion dataset was tested with the model trained on AffectNet. The images were resized to 256 × 256 pixels and then cropped to 224 × 224 pixels with their faces centered. The derived average accuracy was 61.99%, which was slightly higher than that of AffectNet. Similar to the validation results of AffectNet, the “happy” category yielded the highest score, while “fear” and “disgust” had the lowest scores. The normalized confusion matrix is shown in Fig.  12 , and the feature maps are shown in Fig.  13 . In contrast with the feature maps of AffectNet, the discriminative locations were not centered around the mouth and nose but were located more on the right half of the face. It shows that the model lacked generalizability for Asian faces in the laboratory setting. This experiment shows that the model trained on AffectNet has limited prediction performance on other datasets.

figure 12

Normalized confusion matrix produced for the Asian facial emotion dataset tested with the model trained on AffectNet.

figure 13

Feature maps produced for the Asian facial emotion dataset.

The process of interpreting facial expressions is also subject to cultural and individual differences that are not considered by the model during the training phase. The feature maps in Figs. 9 and 10 show that the proposed model focused more on the mouth and nose but less on the eyes. To obtain correct FER results, subtle features such as wrinkles and eyes may also be critical. However, the proposed model does not capture features that are far from the mouth or nose. The test results obtained on the Asian face emotion dataset shows that the discriminative regions are skewed toward the right half of the face. This finding indicates that the limited generalizability of the model to Asian faces in the laboratory setting. Although AffectNet is a diverse dataset containing representations from various cultures and races, it is still limited to a tiny portion of the global population. In contrast, the RAF-DB contains similar ethnic groups and settings similar to AffectNet. The validation results obtained on the RAF-DB (77.37%) is better than that on the Asian face emotion dataset. The results show that for datasets with similar ethnic groups, the model trained on a more diverse and wilder dataset (AffectNet) performs better prediction on a more constrained dataset (the RAF-DB in this work).

This study addresses how the neural network model learns to identify facial emotions. The features displayed on emotion images were derived with a CNN, and these emotional features were visualized to determine the facial landmarks that contains major information. Conclusions drawn based on the findings are listed below.

A cross-database validation experiment was conducted for AffectNet and RAF-DB. An accuracy of 77.37% was achieved when the RAF-DB was predicted by the model trained on AffectNet. The accuracy is comparable to the result in 21 . An accuracy of 42.6% was achieved when AffectNet was predicted by the model trained on RAF-DB. These results agree with the fact that AffectNet exhibits more diversity than RAF-DB in terms of facial emotion images. Moreover, transfer learning dramatically increases the accuracy by 26.95% for RAF-DB. The finding highlights the significance of using transfer learning to improve the performance of FER algorithms by training the associated models on AffectNet for pretrained weights.

The visualized emotion feature maps show that the mouth and nose contain the major information, while the eyes and ears contain the minor information when the neural network learns to perform FER. This paradigm is similar to how human observes emotions.

When comparing the feature maps that were correctly classified (those with softmax scores exceeding 90%) with those that were incorrectly classified, it can be seen that the network model focuses on similar features with no major differences. This result indicates that FER requires the observation of large patches near distinctive areas on a face.

Data availability

The datasets applied in this study are available with authorization from the following websites for AffectNet ( http://mohammadmahoor.com/affectnet/ ), the Real-World Affective Faces Database (RAF-DB; http://www.whdeng.cn/raf/model1.html ) and the Asian facial emotion dataset ( http://mil.psy.ntu.edu.tw/ssnredb/logging.php?action=login ). However, restrictions apply to the availability of these data, which were used under license for the current study and thus are not publicly available. The data are, however, available from the authors upon reasonable request and with permission from AffectNet, the RAF-DB and the Asian facial emotion dataset. The training and analysis processes are discussed in the research methodology.

Vo, T. H., Lee, G. S., Yang, H. J. & Kim, S. H. Pyramid with super resolution for in-the-wild facial expression recognition. IEEE Access 8 , 131988–132001 (2020).

Article   Google Scholar  

Mehrabian, A. Nonverbal communication (Aldine Transaction, 2007).

Ekman, P. Darwin, deception, and facial expression. Ann. N. Y. Acad. Sci. 1000, 205–2 (Kortli & Jridi, 2020) (2006).

Farzaneh, A. H. & Qi, X. Facial expression recognition in the wild via deep attentive center loss in 2021 IEEE winter conference on applications of computer vision (WACV) 2401–2410 (IEEE, 2021).

Alnuaim, A. A. et al. Human-computer interaction for recognizing speech emotions using multilayer perceptron classifier. J. Healthc. Eng. 2022 , 6005446 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Kumari, H. M. L. S. Facial expression recognition using convolutional neural network along with data augmentation and transfer learning (2022).

Ekman, P., Dalgleish, T. & Power, M. Handbook of cognition and emotion (Wiley, 1999).

Ekman, P. Are there basic emotions?. Psychol. Rev. 99 , 550–553 (1992).

Article   CAS   PubMed   Google Scholar  

Russell, J. A. A circumplex model of affect. J. Pers. Soc. Psychol. 39 , 1161–1178 (1980).

Goodfellow, I. J. et al. Challenges in representation learning: A report on three machine learning contests in Neural information processing (eds. Lee, M., Hirose, A., Hou, Z. & Kil, R) 117–124 (Springer, 2013).

Maithri, M. et al. Automated emotion recognition: Current trends and future perspectives. Comput. Method Prog. Biomed. 215 , 106646 (2022).

Article   CAS   Google Scholar  

Li, S. & Deng, W. Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 13 , 1195–1215 (2022).

Canal, F. Z. et al. A survey on facial emotion recognition techniques: A state-of-the-art literature review. Inf. Sci. 582 , 593–617 (2022).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition in 2016 IEEE conference on computer vision and pattern recognition (CVPR) 770–778 (IEEE, 2016).

Mollahosseini, A., Hasani, B. & Mahoor, M. H. AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10 , 18–31 (2019).

Schoneveld, L. & Othmani, A. Towards a general deep feature extractor for facial expression recognition in 2021 IEEE international conference on image processing (ICIP) 2339–2342 (IEEE, 2021).

Rajan, V., Brutti, A. & Cavallaro, A. Is cross-attention preferable to self-attention for multi-modal emotion recognition? in ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) 4693–4697 (IEEE, 2022).

Zhuang, X., Liu, F., Hou, J., Hao, J. & Cai, X. Transformer-based interactive multi-modal attention network for video sentiment detection. Neural Process. Lett. 54 , 1943–1960 (2022).

Zhang, Y., Wang, C., Ling, X. & Deng, W. Learn from all: Erasing attention consistency for noisy label facial expression recognition in Lecture notes in computer science (eds. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner T.) 418–434 (Springer, 2022).

Savchenko, A. V., Savchenko, L. V. & Makarov, I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Trans. Affect. Comput. 13 , 2132–2143 (2022).

Fan, Y., Lam, J. C. K. & Li, V. O. K. Multi-region ensemble convolutional neural network for facial expression recognition in Artificial neural networks and machine learning—ICANN 2018 (eds. Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L. & Maglogiannis, I.) 84–94 (Springer International Publishing, 2018).

Wang, Z., Zeng, F., Liu, S. & Zeng, B. OAENet: Oriented attention ensemble for accurate facial expression recognition. Pattern Recognit. 112 , 107694 (2021).

Schoneveld, L., Othmani, A. & Abdelkawy, H. Leveraging recent advances in deep learning for audio-Visual emotion recognition. Pattern Recognit. Lett. 146 , 1–7 (2021).

Article   ADS   Google Scholar  

Hwooi, S. K. W., Othmani, A. & Sabri, A. Q. M. Deep learning-based approach for continuous affect prediction from facial expression images in valence-arousal space. IEEE Access 10 , 96053–96065 (2022).

Sun, L., Lian, Z., Tao, J., Liu, B. & Niu, M. Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism in Proceedings of the 1st international on multimodal sentiment analysis in real-life media challenge and workshop 27–34 (ACM, 2020).

Allognon, S. O. C., de S. Britto, A. & Koerich, A. L. Continuous emotion recognition via deep convolutional autoencoder and support vector regressor in 2020 international joint conference on neural networks (IJCNN) 1–8 (IEEE, 2020).

Huang, C. Combining convolutional neural networks for emotion recognition in 2017 IEEE MIT undergraduate research technology conference (URTC) 1–4 (IEEE, 2017).

Mao, J. et al. POSTER V2: A simpler and stronger facial expression recognition network. arXiv preprint arXiv:2301.12149 (2023).

Le, N. et al. Uncertainty-aware label distribution learning for facial expression recognition in 2023 IEEE/CVF winter conference on applications of computer vision (WACV) 6088–6097 (IEEE, 2023).

Singh, S. & Prasad, S. V. A. V. Techniques and challenges of face recognition: A critical review. Proc. Comput. Sci. 143 , 536–543 (2018).

Kortli, Y., Jridi, M., Falou, A. A. & Atri, M. Face recognition systems: A survey. Sensors (Basel, Switzerland) 20 , 342 (2020).

Article   ADS   PubMed   Google Scholar  

Shirazi, M. S. & Bati, S. Evaluation of the off-the-shelf CNNs for facial expression recognition in Lecture notes in networks and systems (ed. Arai, K.) 466–473 (Springer, 2022).

Chen, D., Wen, G., Li, H., Chen, R. & Li, C. Multi-relations aware network for in-the-wild facial expression recognition. IEEE Trans. Circuits Syst. Video Technol. https://doi.org/10.1109/tcsvt.2023.3234312 (2023).

Heidari, N. & Iosifidis, A. Learning diversified feature representations for facial expression recognition in the wild. arXiv preprint arXiv:2210.09381 (2022).

Beaudry, O., Roy-Charland, A., Perron, M., Cormier, I. & Tapp, R. Featural processing in recognition of emotional facial expressions. Cogn. Emot. 28 , 416–432 (2013).

Article   PubMed   Google Scholar  

Bhattacharyya, A. et al. A deep learning model for classifying human facial expressions from infrared thermal images. Sci. Rep. 11 , 20696 (2021).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Alp, N. & Ozkan, H. Neural correlates of integration processes during dynamic face perception. Sci. Rep. 12 , 118 (2022).

Siddiqi, M. H. Accurate and robust facial expression recognition system using real-time YouTube-based datasets. Appl. Intell. 48 , 2912–2929 (2018).

Li, S., Deng, W. H. & Du, J. P. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild in 2017 IEEE conference on computer vision and pattern recognition (CVPR) 2584–2593 (IEEE, 2017).

Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks in 2018 IEEE/CVF conference on computer vision and pattern recognition 7132–7141 (IEEE, 2018).

Chen, C. C., Cho, S. L. & Tseng, R. Y. Taiwan corpora of Chinese emotions and relevant psychophysiological data-Behavioral evaluation norm for facial expressions of professional performer. Chin. J. Psychol. 55 , 439–454 (2013).

Google Scholar  

Download references

Acknowledgements

This work was funded in part by National Science and Technology Council (project number MOST 111-2635-E-242-001 -).

Author information

Authors and affiliations.

Department of Mechanical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan

Zi-Yu Huang, Chia-Chin Chiang & Hsin-Lung Chung

Graduate Institute of Applied Physics, National Chengchi University, Taipei, Taiwan

Jian-Hao Chen & Hsiu-Chuan Hsu

Department of Occupational Safety and Hygiene, Fooyin University, Kaohsiung, Taiwan

Yi-Chian Chen

Department of Nursing, Hsin Sheng Junior College of Medical Care and Management, Taoyuan, Taiwan

Yu-Ping Cai

Department of Computer Science, National Chengchi University, Taipei, Taiwan

Hsiu-Chuan Hsu

You can also search for this author in PubMed   Google Scholar

Contributions

Z.-Y. Huang contributed to writing the manuscript. C.-C. Chiang contributed to overseeing and finalizing the paper. J.-H. Chen conducted all computations and contributed equally as the first author. Y.-C. Chen contributed to designing the research and editing the manuscript. H.-L. Chung contributed to editing the manuscript. Y.-P. C. assessed the emotion classification field and contributed to the literature review. H.-C. H. designed the study and provided conceptual guidance. All authors discussed and reviewed the manuscript.

Corresponding authors

Correspondence to Yi-Chian Chen or Hsiu-Chuan Hsu .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Huang, ZY., Chiang, CC., Chen, JH. et al. A study on computer vision for facial emotion recognition. Sci Rep 13 , 8425 (2023). https://doi.org/10.1038/s41598-023-35446-4

Download citation

Received : 08 December 2022

Accepted : 18 May 2023

Published : 24 May 2023

DOI : https://doi.org/10.1038/s41598-023-35446-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

facial emotion recognition research papers 2021

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

A Brief Review of Facial Emotion Recognition Based on Visual Information

Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of the main information channels in interpersonal communication. This paper provides a brief review of researches in the field of FER conducted over the past decades. First, conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. Deep-learning-based FER approaches using deep networks enabling “end-to-end” learning are then presented. This review also focuses on an up-to-date hybrid deep-learning approach combining a convolutional neural network (CNN) for the spatial features of an individual frame and long short-term memory (LSTM) for temporal features of consecutive frames. In the later part of this paper, a brief review of publicly available evaluation metrics is given, and a comparison with benchmark results, which are a standard for a quantitative comparison of FER researches, is described. This review can serve as a brief guidebook to newcomers in the field of FER, providing basic knowledge and a general understanding of the latest state-of-the-art studies, as well as to experienced researchers looking for productive directions for future work.

1. Introduction

Facial emotions are important factors in human communication that help us understand the intentions of others. In general, people infer the emotional states of other people, such as joy, sadness, and anger, using facial expressions and vocal tone. According to different surveys [ 1 , 2 ], verbal components convey one-third of human communication, and nonverbal components convey two-thirds. Among several nonverbal components, by carrying emotional meaning, facial expressions are one of the main information channels in interpersonal communication. Therefore, it is natural that research of facial emotion has been gaining lot of attention over the past decades with applications not only in the perceptual and cognitive sciences, but also in affective computing and computer animations [ 2 ].

Interest in automatic facial emotion recognition (FER) (Expanded form of the acronym FER is different in every paper, such as facial emotion recognition and facial expression recognition. In this paper, the term FER refers to facial emotion recognition as this study deals with the general aspects of recognition of facial emotion expression.) has also been increasing recently with the rapid development of artificial intelligent techniques, including in human-computer interaction (HCI) [ 3 , 4 ], virtual reality (VR) [ 5 ], augment reality (AR) [ 6 ], advanced driver assistant systems (ADASs) [ 7 ], and entertainment [ 8 , 9 ]. Although various sensors such as an electromyograph (EMG), electrocardiogram (ECG), electroencephalograph (EEG), and camera can be used for FER inputs, a camera is the most promising type of sensor because it provides the most informative clues for FER and does not need to be worn.

This paper first divides researches on automatic FER into two groups according to whether the features are handcrafted or generated through the output of a deep neural network.

In conventional FER approaches, the FER is composed of three major steps, as shown in Figure 1 : (1) face and facial component detection, (2) feature extraction, and (3) expression classification. First, a face image is detected from an input image, and facial components (e.g., eyes and nose) or landmarks are detected from the face region. Second, various spatial and temporal features are extracted from the facial components. Third, the pre-trained FE classifiers, such as a support vector machine (SVM), AdaBoost, and random forest, produce the recognition results using the extracted features.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00401-g001.jpg

Procedure used in conventional FER approaches: From input images ( a ), face region and facial landmarks are detected ( b ), spatial and temporal features are extracted from the face components and landmarks ( c ), and the facial expression is determined based on one of facial categories using pre-trained pattern classifiers (face images are taken from CK+ dataset [ 10 ]) ( d ).

In contrast to traditional approaches using handcrafted features, deep learning has emerged as a general approach to machine learning, yielding state-of-the-art results in many computer vision studies with the availability of big data [ 11 ].

Deep-learning-based FER approaches highly reduce the dependence on face-physics-based models and other pre-processing techniques by enabling “end-to-end” learning to occur in the pipeline directly from the input images [ 12 ]. Among the several deep-learning models available, the convolutional neural network (CNN), a particular type of deep learning, is the most popular network model. In CNN-based approaches, the input image is convolved through a filter collection in the convolution layers to produce a feature map. Each feature map is then combined to fully connected networks, and the face expression is recognized as belonging to a particular class-based the output of the softmax algorithm. Figure 2 shows the procedure used by CNN-based FER approaches.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00401-g002.jpg

Procedure of CNN-based FER approaches: ( a ) The input images are convolved using filters in the convolution layers. ( b ) From the convolution results, feature maps are constructed and max-pooling (subsampling) layers lower the spatial resolution of the given feature maps. ( c ) CNNs apply fully connected neural-network layers behind the convolutional layers, and ( d ) a single face expression is recognized based on the output of softmax (face images are taken from CK+ dataset [ 10 ]).

FER can also be divided into two groups according to whether it uses frame or video images [ 13 ]. First, static (frame-based) FER relies solely on static facial features obtained by extracting handcrafted features from selected peak expression frames of image sequences. Second, dynamic (video-based) FER utilizes spatio-temporal features to capture the expression dynamics in facial expression sequences. Although dynamic FER is known to have a higher recognition rate than static FER because it provides additional temporal information, it does suffer from a few drawbacks. For example, the extracted dynamic features have different transition durations and different feature characteristics of the facial expression depending on the particular faces. Moreover, temporal normalization used to obtain expression sequences with a fixed number of frames may result in a loss of temporal scale information.

1.1. Terminology

Before reviewing researches related to FER, special terminology playing an important role in FER research is listed below:

  • The facial action coding system (FACS) is a system based on facial muscle changes and can characterize facial actions to express individual human emotions as defined by Ekman and Friesen [ 14 ] in 1978. FACS encodes the movements of specific facial muscles called action units (AUs), which reflect distinct momentary changes in facial appearance [ 15 ].
  • Facial landmarks (FLs) are visually salient points in facial regions such as the end of the nose, ends of the eye brows, and the mouth, as described in Figure 1 b. The pairwise positions of each of two landmark points, or the local texture of a landmark, are used as a feature vector of FER. In general, FL detection approaches can be categorized into three types according to the generation of models such as active shape-based model (ASM) and appearance-based model (AAM), a regression-based model with a combination of local and global models, and CNN-based methods. FL models are trained model from the appearance and shape variations from a coarse initialization. Then, the initial shape is moved to a better position step-by-step until convergence [ 16 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00401-g003.jpg

Sample examples of various facial emotions and AUs: ( a ) basic emotions (sad, fearful, and angry), (face images are taken from CE dataset [ 17 ]) ( b ) compound emotions (happily surprised, happily disgusted, and sadly fearful) (face images are taken from CE dataset [ 17 ]), ( c ) spontaneous expressions, and (face images are taken from YouTube) ( d ) AUs (upper and lower face) (face images are taken from CK+ dataset [ 10 ]).

  • Compound emotions (CEs) are a combination of two basic emotions. Du et al. [ 17 ] introduced 22 emotions, including seven basic emotions, 12 compound emotions most typically expressed by humans, and three additional emotions (appall, hate, and awe). Figure 3 b shows some examples of CE.
  • Micro expressions (MEs) indicate more spontaneous and subtle facial movements that occur involuntarily. They tend to reveal a person’s genuine and underlying emotions within a short period of time. Figure 3 c shows some examples of MEs.

Prototypical AUs observed in each basic and compound emotion category, adapted from [ 18 ].

Table 1 shows the prototypical AUs observed in each basic and compound emotion category.

1.2. Contributions of this Review

Despite the long history related to FER, there are no comprehensive literature reviews on the topic of FER. Some review papers [ 19 , 20 ] have focused solely on conventional researches without introducing deep-leaning-based approaches. Recently, Ghayoumi [ 21 ] introduced a quick review of deep learning in FER. However, only a review of simple differences between conventional approaches and deep-learning-based approaches was provided. Therefore, this paper is dedicated to a brief literature review, from conventional FER to recent advanced FER. The main contributions of this review are as follows:

  • The focus is on providing a general understanding of the state-of-the art FER approaches, and helping new researchers understand the essential components and trends in the FER field.
  • Various standard databases that include still images and video sequences for FER use are introduced, along with their purposes and characteristics.
  • Key aspects are compared between conventional FER and deep-learning-based FER in terms of accuracy and resource requirements. Although deep-learning-based FER generally produces better FER accuracy than conventional FER, it also requires a large amount of processing capacity, such as a graphic processing unit (GPU) and central processing unit (CPU). Therefore, many current FER algorithms are still being used in embedded systems, including smartphones.
  • A new direction and application for future FER studies are presented.

1.3. Organization of this Review

The remainder of this paper is organized as follows. In Section 2 , conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. In Section 3 , advanced FER approaches using deep-learning algorithms are presented. In Section 4 and Section 5 , a brief review of publicly available FER database and evaluation metrics with a comparison with benchmark results are provided. Finally, Section 6 offers some concluding remarks and discussion of future work.

2. Conventional FER Approaches

For automatic FER systems, various types of conventional approaches have been studied. The commonality of these approaches is detecting the face region and extracting geometric features, appearance features, or a hybrid of geometric and appearance features on the target face.

For the geometric features, the relationship between facial components is used to construct a feature vector for training [ 22 , 23 ]. Ghimire and Lee [ 23 ] used two types of geometric features based on the position and angle of 52 facial landmark points. First, the angle and Euclidean distance between each pair of landmarks within a frame are calculated, and second, the distance and angles are subtracted from the corresponding distance and angles in the first frame of the video sequence. For the classifier, two methods are presented, either using multi-class AdaBoost with dynamic time warping, or using a SVM on the boosted feature vectors.

The appearance features are usually extracted from the global face region [ 24 ] or different face regions containing different types of information [ 25 , 26 ]. As an example of using global features, Happy et al. [ 24 ] utilized a local binary pattern (LBP) histogram of different block sizes from a global face region as the feature vectors, and classified various facial expressions using a principal component analysis (PCA). Although this method is implemented in real time, the recognition accuracy tends to be degraded because it cannot reflect local variations of the facial components to the feature vector. Unlike a global-feature-based approach, different face regions have different levels of importance. For example, the eyes and mouth contain more information than the forehead and cheek. Ghimire et al. [ 27 ] extracted region-specific appearance features by dividing the entire face region into domain-specific local regions. Important local regions are determined using an incremental search approach, which results in a reduction of the feature dimensions and an improvement in the recognition accuracy.

For hybrid features, some approaches [ 18 , 27 ] have combined geometric and appearance features to complement the weaknesses of the two approaches and provide even better results in certain cases.

In video sequences, many systems [ 18 , 22 , 23 , 28 ] are used to measure the geometrical displacement of facial landmarks between the current frame and previous frame as temporal features, and extracts appearance features for the spatial features. The main difference between FER for still images and for video sequences is that the landmarks in the latter are tracked frame-by-frame and the system generates new dynamic features through displacement between the previous and current frames. Similar classification algorithms are then used in the video sequences, as described in Figure 1 . To recognize micro-expression, high speed camera is used to capture video sequences of the face. Polikovsky et al. [ 29 ] presented facial micro-expressions recognition in video sequences captured from 200 frames per second (fps) high speed camera. This study divides face regions into specific regions, and then 3D-Gradients orientation histogram is generated from the motion in each region for FER.

Apart from FER of 2D images, 3D and 4D (dynamic 3D) recordings are increasingly used in expression analysis research because of the problems presented in 2D images caused by inherent variations in pose and illumination. 3D facial expression recognition generally consists of feature extraction and classification. One thing to note in 3D is that dynamic and static system are very different because of the nature of data. Static systems extract feature from statistical models such as deformable model, active shape model, analysis of 2D representations, and distance-based features. In contrast, dynamic systems utilize 3D image sequences for analysis of facial expressions such as 3D motion-based features. For FER, 3D images also use the similar conventional classification algorithms [ 29 , 30 ]. Although 3D-based FER showed higher performance than 2D-based FER, 3D and 4D-based FER also has certain problems such as a high computational cost owing to a high resolution and frame rate, as well as the amount of 3D information involved.

Some researchers [ 31 , 32 , 33 , 34 , 35 ] have tried to recognize facial emotions using infrared images instead of visible light spectrum (VIS) image because visible light (VIS) image is variable according to the status of illumination. Zhao et al. [ 31 ] used near-infrared (NIR) video sequences and LBP-TOP (Local binary patterns from three orthogonal planes) feature descriptors. This study uses component-based facial features to combine geometric and appearance information of face. For FER, a SVM and sparse representation classifiers are used. Shen et al. [ 32 ] used infrared thermal videos by extracting horizontal and vertical temperature difference from different facial sub-regions. For FER, the Adaboost algorithm with the weak classifiers of k-Nearest Neighbor is used. Szwoch and Pieniążek [ 33 ] recognized facial expression and emotion based only on depth channel from Microsoft Kinect sensor without using camera. This study uses local movements within the face area as the feature and recognized facial expressions using relations between particular emotions. Sujono and Gunawan [ 34 ] used Kinect motion sensor to detect face region based on depth information and active appearance model (AAM) to track the detected face. To role of AAM is to adjust shape and texture model in a new face, when there is variation of shape and texture comparing to the training result. To recognize facial emotion, the change of key features in AAM and fuzzy logic based on prior knowledge derived from FACS are used. Wei et al. [ 35 ] proposed FER using color and depth information by Kinect sensor together. This study extracts facial feature points vector by face tracking algorithm using captured sensor data and recognize six facial emotions by random forest algorithm.

Commonly, conventional approaches determine features and classifiers by experts. For feature extraction, many well-known handcrafted feature, such as HoG, LBP, distance and angle relation between landmarks are used and the pre-trained classifiers, such as SVM, AdaBoost, and random forest, are also used for FE recognition based on the extracted features. Conventional approaches require relatively lower computing power and memory than deep learning-based approaches. Therefore, these approaches are still being studied for use in real-time embedded systems because of their low computational complexity and high degree of accuracy [ 22 ]. However, feature extraction and the classifiers should be designed by the programmer and they cannot be jointly optimized to improve performance [ 36 , 37 ].

Table 2 summarizes the representative conventional FER approaches and their main advantages.

A summary of publicly available databases related to FER. (The detail information on database is described in Section 4 ).

3. Deep-Learning Based FER Approaches

In recent decades, there has been a breakthrough in deep-learning algorithms applied to the field of computer vision, including a CNN and recurrent neural network (RNN). These deep-learning-based algorithms have been used for feature extraction, classification, and recognition tasks. The main advantage of a CNN is to completely remove or highly reduce the dependence on physics-based models and/or other pre-processing techniques by enabling “end-to-end” learning directly from input images [ 44 ]. For these reasons, CNN has achieved state-of-the-art results in various fields, including object recognition, face recognition, scene understanding, and FER.

A CNN contains three types of heterogeneous layers: convolution layer, max pooling layer, and fully connected layers, as shown in Figure 2 . Convolutional layers take image or feature maps as the input, and convolve these inputs with a set of filter banks in a sliding-window manner to output feature maps that represent a spatial arrangement of the facial image. The weights of convolutional filters within a feature map are shared, and the inputs of the feature map layer are locally connected [ 45 ]. Second, subsampling layers lower the spatial resolution of the representation by averaging or max-pooling the given input feature maps to reduce their dimensions and thereby ignore variations in small shifts and geometric distortions [ 45 , 46 ]. The last fully connected layers of a CNN structure compute the class scores on the entire original image. Most deep-learning-based methods [ 46 , 47 , 48 , 49 ] have adapted a CNN directly for AU detection.

Breuer and Kimmel [ 47 ] employed CNN visualization techniques to understand a model learned using various FER datasets, and demonstrated the capability of networks trained on emotion detection, across both datasets and various FER-related tasks. Jung et al. [ 48 ] used two different types of CNN: the first extracts temporal appearance features from the image sequences, whereas the second extracts temporal geometry features from temporal facial landmark points. These two models are combined using a new integration method to boost the performance of facial expression recognition.

Zhao et al. [ 49 ] proposed deep region and multi-label learning (DRML), which is a unified deep network. DRML is a region layer that uses feed-forward functions to induce important facial regions, and forces the learned weights to capture structural information of the face. The complete network is end-to-end trainable, and automatically learns representations robust to variations inherent within a local region.

As we determined in our review, many approaches have adopted a CNN directly for FER use. However, because CNN-based methods cannot reflect temporal variations in the facial components, a recent hybrid approach combining a CNN for the spatial features of individual frames, and long short-term memory (LSTM) for the temporal features of consecutive frames, was developed. LSTM is a special type of RNN capable of learning long-term dependencies. LSTMs are explicitly designed to solve the long-term dependency problem using short-term memory. An LSTM has a chain-like structure, although the repeating modules have a different structure, as shown in Figure 4 . All recurrent neural networks have a chain-like form of four repeating modules of a neural network [ 50 ]:

  • The cell state is a horizontal line running through the top of the diagram, as shown in Figure 4 . An LSTM has the ability to remove or add information to the cell state.
  • A forget gate layer is used to decide what new information to store in the cell state.
  • An input gate layer is used to decide which values will be updated in the cell.
  • An output gate layer provides outputs based on the cell state.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00401-g004.jpg

The basic structure of an LSTM, adapted from [ 50 ]. ( a ) One LSTM cell contains four interacting layers: the cell state, an input gate layer, a forget gate layer, and an output gate layer, ( b ) The repeating module of cells in an LSTM.

The LSTM or RNN model for modeling sequential images has two advantages compared to standalone approaches. First, LSTM models are straightforward in terms of fine-tuning end-to-end when integrated with other models such as a CNN. Second, an LSTM supports both fixed-length and variable-length inputs or outputs [ 51 ].

The representative studies using a combination of a CNN and an LSTM (RNN) include the following:

Kahou et al. [ 11 ] proposed a hybrid RNN-CNN framework for propagating information over a sequence using a continuously valued hidden-layer representation. In this work, the authors presented a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge [ 52 ], and proved that a hybrid CNN-RNN architecture for a facial expression analysis can outperform a previously applied CNN approach using temporal averaging for aggregation.

Kim et al. [ 13 ] utilized representative expression-states (e.g., the onset, apex, and offset of expressions), which can be specified in facial sequences regardless of the expression intensity. The spatial image characteristics of the representative expression-state frames are learned using a CNN. In the second part, temporal characteristics of the spatial feature representation in the first part are learned using an LSTM of the facial expression.

Chu et al. [ 53 ] proposed a multi-level facial AU detection algorithm combining spatial and temporal features. First, the spatial representations are extracted using a CNN, which is able to reduce person-specific biases caused by handcrafted descriptors (e.g., HoG and Gabor). To model the temporal dependencies, LSTMs are stacked on top of these representations, regardless of the lengths of the input video sequences. The outputs of CNNs and LSTMs are further aggregated into a fusion network to produce a per-frame prediction of 12 AUs.

Hasani and Mahoor [ 54 ] proposed the 3D Inception-ResNet architecture followed by an LSTM unit that together extracts the spatial relations and temporal relations within the facial images between different frames in a video sequence. Facial landmark points are also used as inputs of this network, emphasizing the importance of facial components rather than facial regions, which may not contribute significantly to generating facial expressions.

Graves et al. [ 55 ] used a recurrent network to consider the temporal dependencies present in the image sequences during classification. In experimental results using two types of LSTM (bidirectional LSTM and unidirectional LSTM), this study proved that a bidirectional network provides a significantly better performance than a unidirectional LSTM.

Jain et al. [ 56 ] proposed a multi-angle optimal pattern-based deep learning (MAOP-DL) method to rectify the problem of sudden changes in illumination, and find the proper alignment of the feature set by using multi-angle-based optimal configurations. Initially, this approach subtracts the background and isolates the foreground from the images, and then extracts the texture patterns and the relevant key features of the facial points. The relevant features are then selectively extracted, and an LSTM-CNN is employed to predict the required label for the facial expressions.

Commonly, deep learning-based approaches determine features and classifiers by deep neural networks experts, unlike conventional approaches. Deep learning-based approaches extract optimal features with the desired characteristics directly from data using deep convolutional neural networks. However, it is not easy to collect a large amount of training data for the facial emotion under the different conditions enough to learn deep neural networks. Moreover, deep learning-based approaches require more a higher-level and massive computing device than convention approaches to operate training and testing [ 35 ]. Therefore, it is necessary to reduce the computational burden at inference time of deep learning algorithm.

Among the many approaches based on a standalone CNN or combination of LSTM and CNN, some representative works are shown in Table 3 .

Summary of FER systems based on deep learning.

As determined through our review conducted thus far, the general frameworks of the hybrid CNN-LSTM and CNN-RNN-based FER approaches have similar structures, as shown in Figure 5 . In summary, the basic framework of CNN-LSTM (RNN) is to combine an LSTM with a deep hierarchical visual feature extractor such as a CNN model. Therefore, this hybrid model can learn to recognize and synthesize temporal dynamics for tasks involving sequential images. As shown in Figure 5 , each visual feature determined through a CNN is passed to the corresponding LSTM, and produces a fixed or variable-length vector representation. The outputs are then passed into a recurrent sequence-learning module. Finally, the predicted distribution is computed by applying softmax [ 51 , 53 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00401-g005.jpg

Overview of the general hybrid deep-learning framework for FER. The outputs of the CNNs and LSTMs are further aggregated into a fusion network to produce a per-frame prediction, adapted from [ 53 ].

4. Brief Introduction to FER Database

In the field of FER, numerous databases have been used for comparative and extensive experiments. Traditionally, human facial emotions have been studied using either 2D static images or 2D video sequences. A 2D-based analysis has difficulty handling large pose variations and subtle facial behaviors. The analysis of 3D facial emotions will facilitate an examination of the fine structural changes inherent in spontaneous expressions [ 40 ]. Therefore, this sub-section briefly introduces some popular databases related to FER consisting of 2D and 3D video sequences and still images:

  • The Extended Cohn-Kanade Dataset (CK+) [ 10 ]: CK+ contains 593 video sequences on both posed and non-posed (spontaneous) emotions, along with additional types of metadata. The age range of its 123 subjects is from 18 to 30 years, most of who are female. Image sequences may be analyzed for both action units and prototypic emotions. It provides protocols and baseline results for facial feature tracking, AUs, and emotion recognition. The images have pixel resolutions of 640 × 480 and 640 × 490 with 8-bit precision for gray-scale values.
  • Compound Emotion (CE) [ 17 ]: CE contains 5060 images corresponding to 22 categories of basic and compound emotions for its 230 human subjects (130 females and 100 males, mean age of 23). Most ethnicities and races are included, including Caucasian, Asian, African, and Hispanic. Facial occlusions are minimized, with no glasses or facial hair. Male subjects were asked to shave their faces as cleanly as possible, and all participants were also asked to uncover their forehead to fully show their eyebrows. The photographs are color images taken using a Canon IXUS with a pixel resolution of 3000 × 4000.
  • Denver Intensity of Spontaneous Facial Action Database (DISFA) [ 38 ]: DISFA consists of 130,000 stereo video frames at high resolution (1024 × 768) of 27 adult subjects (12 females and 15 males) with different ethnicities. The intensities of the AUs (0–5 scale) for all video frames were manually scored using two human experts in FACS. The database also includes 66 facial landmark points for each image in the database. The original size of each facial image is 1024 pixels × 768 pixels.
  • Binghamton University 3D Facial Expression (BU-3DFE) [ 40 ]: Because 2D still images of faces are commonly used in FER, Yin et al. [ 40 ] at Binghamton University proposed a databases of annotated 3D facial expressions, namely, BU-3DFE 3D. It was designed for research on 3D human faces and facial expressions, and for the development of a general understanding of human behavior. It contains a total of 100 subjects, 56 females and 44 males, displaying six emotions. There are 25 3D facial emotion models per subject in the database, and a set of 83 manually annotated facial landmarks associated with each model. The original size of each facial image is 1040 pixels × 1329 pixels.
  • Japanese Female Facial Expressions (JAFFE) [ 41 ]: The JAFFE database contains 213 images of seven facial emotions (six basic facial emotions and one neutral) posed by ten different female Japanese models. Each image was rated based on six emotional adjectives using 60 Japanese subjects. The original size of each facial image is 256 pixels × 256 pixels.
  • Extended Yale B face (B+) [ 42 ]: This database consists of a set of 16,128 facial images taken under a single light source, and contains 28 distinct subjects for 576 viewing conditions, including nine poses for each of 64 illumination conditions. The original size of each facial image is 320 pixels × 243 pixels.
  • MMI [ 43 ]: MMI consists of over 2900 video sequences and high-resolution still images of 75 subjects. It is fully annotated for the presence of AUs in the video sequences (event coding), and partially coded at the frame-level, indicating for each frame whether an AU is in a neutral, onset, apex, or offset phase. It contains a total of 238 video sequences on 28 subjects, both males and females. The original size of each facial image is 720 pixels × 576 pixels.
  • Binghamton-Pittsburgh 3D Dynamic Spontaneous (BP4D-Spontanous) [ 58 ]: BP4D-spontanous is a 3D video database that includes a diverse group of 41 young adults (23 women, 18 men) with spontaneous facial expressions. The subjects were 18–29 years in age. Eleven are Asian, six are African-American, four are Hispanic, and 20 are Euro-Americans. The facial features were tracked in the 2D and 3D domains using both person-specific and generic approaches. The database promotes the exploration of 3D spatiotemporal features during subtle facial expressions for a better understanding of the relation between pose and motion dynamics in facial AUs, as well as a deeper understanding of naturally occurring facial actions. The original size of each facial image is 1040 pixels × 1329 pixels.
  • The Karolinska Directed Emotional Face (KDEF) [ 59 ]: This database contains 4900 images of human emotional facial expressions. The database consists of 70 individuals, each displaying seven different emotional expressions photographed from five different angles. The original size of each facial image is 562 pixels × 762 pixels.

Table 4 shows a summary of these publicly available databases.

A summary of publicly available databases related to FER.

Figure 6 shows examples of the nine databases for FER with 2D and 3D images and video sequences.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00401-g006.jpg

Examples of nine representative databases related to FER. Databases ( a ) through ( g ) support 2D still images and 2D video sequences, and databases ( h ) through ( i ) support 3D video sequences.

Unlike the databases described above, MPI facial expression database [ 60 ] collects a large variety of natural emotional and conversational expressions under the assumption that people understand emotions by analyzing both the conversational expressions as well as the emotional expressions. This database consists of more than 18,800 samples of video sequences from 10 females and nine male models displaying various facial expressions recorded from one frontal and two lateral views.

In recent, other sensors, such as NIR camera, thermal camera, and Kinect sensors, are having interesting of FER researches because visible light image is easily changeable when there are changes in environmental illumination conditions. As the database captured from NIR camera, Oulu-CASIA NIR&VIS facial expression database [ 31 ] consists of six expressions from 80 people between 23 and 58 years old. 73.8% of the subjects are males. Natural visible and infrared facial expression (USTC-NVIE) database [ 32 ] collected both spontaneous and posed expressions of more than 100 subjects simultaneously using a visible and an infrared thermal camera. Facial expressions and emotions database (FEEDB) is a multimodal database of facial expressions and emotion recorded using Microsoft Kinect sensor. It contains of 1650 recordings of 50 persons posing for 33 different facial expressions and emotions [ 33 ].

As described here, various sensors other than the camera sensor are used for FER, but there is a limitation in improving the recognition performance with only one sensor. Therefore, it is predicted that the attempts to increase the FER through the combination of various sensors, will continue in the future.

5. Performance Evaluation of FER

Given the FER approaches, evaluation metrics of the FER approaches are crucial because they provide a standard for a quantitative comparison. In this section, a brief review of publicly available evaluation metrics and a comparison with the benchmark results are provided.

5.1. Subject-Independent and Cross-Database Tasks

Many approaches are used to evaluate the accuracy using two different experiment protocols: subject-independent and cross-dataset tasks [ 55 ]. First, a subject-independent task splits each database into training and validation sets in a strict subject-independent manner. This task is also called a K-fold cross-validation. The purpose of K-fold cross-validation is to limit problems such as overfitting and provide insight regarding how the model will generalize into an independent unknown dataset [ 61 ]. With the K-fold cross-validation technique, each dataset is evenly partitioned into K folds with exclusive subjects. Then, a model is iteratively trained using K-1 folds and evaluated on the remaining fold, until all subjects are tested. Validation is conducted using almost less than 20% of the training subjects. The accuracy is estimated by averaging the recognition rate over K folds. For example, in ten-fold cross-validation adopted for an evaluation, nine folds are used for training, and one fold is used for testing. After this process is performed ten different times, the accuracies of the ten results are averaged and defined as the classifier performance.

The second protocol is a cross-database task. In this task, one dataset is used entirely for testing the model, and the remaining datasets listed in Table 4 are used to train the model. The model is iteratively trained using K-1 datasets and evaluated on the remaining dataset repeatedly until all datasets have been tested. The accuracy is estimated by averaging the recognition rate over K datasets in a manner similar to K-fold cross-validation.

5.2. Evaluation Metrics

The evaluation metrics of FER are classified into four methods using different attributes: precision, recall, accuracy, and F1-score.

The precision (P) is defined as TP/(TP + FP), and the recall (R) is defined as TP/(TP + FN), where TP is the number of true positives in the dataset, FN is the number of false negatives, and FP is the number of false positives. The precision is the fraction of automatic annotations of emotion i that are correctly recognized. The recall is the number of correct recognitions of emotion i over the actual number of images with emotion i [ 18 ]. The accuracy is the ratio of true outcomes (both true positive to true negative) to the total number of cases examined.

Another metric, the F1-score, is divided into two metrics depending on whether they use spatial or temporal data: frame-based F1-score (F1-frame) and event-based F1-score (F1-event). Each metric captures different properties of the results. This means that a frame-based F-score has predictive power in terms of spatial consistency, whereas an event-based F-score has predictive power in terms of the temporal consistency [ 62 ]. A frame-based F1-score is defined as

An event-based F1-score is used to measure the emotion recognition performance at the segment level because emotions occur as a temporal signal.

where ER and EP are event-based recall and precision. ER is the ratio of correctly detected events over the true events, while the EP is the ratio of correctly detected events over the detected events. F1-event considers that there is an event agreement if the overlap is above a certain threshold [ 63 ].

5.3. Evaluation Results

To show a direct comparison between conventional handcrafted-feature-based approaches and deep-learning-based approaches, this review lists public results on the MMI dataset. Table 5 shows the comparative recognition rate of six conventional approaches and six deep-learning-based approaches.

Recognition performance with MMI dataset, adapted from [ 11 ].

As shown in Table 5 , deep-learning-based approaches outperform conventional approaches with an average of 72.65% versus 63.2%. In conventional FER approaches, the reference [ 68 ] has the highest performance than other algorithms. This study tried to compute difference information between the peak expression face and its intra class variation in order to reduce the effect of the facial identity in the feature extraction. Because the feature extraction is robust to face rotation and misalignment, this study achieves relatively accurate FER than other conventional methods. Among several deep-learning-based approaches, two have a relatively higher performance compared to several state-of-the-art methods; a complex CNN network proposed in [ 72 ] consists of two convolutional layers, each followed by max pooling and four Inception layers. This network has a single-component architecture that takes registered facial images as the input and classifies them into one of six basic or one neutral expression. The highest performance approach [ 13 ] also consists of two parts. In the first part, the spatial image characteristics of the representative expression-state frames are learned using a CNN. In the second part, the temporal characteristics of the spatial feature representation in the first part are learned using an LSTM of the facial expression. Based on the accuracy of a complex hybrid approach using spatio-temporal feature representation learning, the FER performance of largely affected not only by the spatial changes but also by the temporal changes.

Although deep-learning-based FER approaches have achieved great success in experimental evaluations, a number of issues remain that deserve further investigation:

  • A large-scale dataset and massive computing power are required for training as the structure becomes increasingly deep.
  • Large numbers of manually collected and labeled datasets are needed.
  • Large memory is demanded, and the training and testing are both time consuming. These memories demanding and computational complexities make deep learning ill-suited for deployment on mobile platforms with limited resources [ 73 ].
  • Considerable skill and experience are required to select suitable hyper parameters, such as the learning rate, kernel sizes of the convolutional filters, and the number of layers. These hyper-parameters have internal dependencies that make them particularly expensive for tuning.
  • Although they work quite well for various applications, a solid theory of CNNs is still lacking, and thus users essentially do not know why or how they work.

6. Conclusions

This paper presented a brief review of FER approaches. As we described, such approaches can be divided into two main streams: conventional FER approaches consisting of three steps, namely, face and facial component detection, feature extraction, and expression classification. The classification algorithms used in conventional FER include SVM, Adaboost, and random forest; by contrast, deep-learning-based FER approaches highly reduce the dependence on face-physics-based models and other pre-processing techniques by enabling “end-to-end” learning in the pipeline directly from the input images. As a particular type of deep learning, a CNN visualizes the input images to help understand the model learned through various FER datasets, and demonstrates the capability of networks trained on emotion detection, across both the datasets and various FER related tasks. However, because CNN-based FER methods cannot reflect the temporal variations in the facial components, hybrid approaches have been proposed by combining a CNN for the spatial features of individual frames, and an LSTM for the temporal features of consecutive frames. A few recent studies have provided an analysis of a hybrid CNN-LSTM (RNN) architecture for facial expressions that can outperform previously applied CNN approaches using temporal averaging for aggregation. However, deep-learning-based FER approaches still have a number of limitations, including the need for large-scale datasets, massive computing power, and large amounts of memory, and are time consuming for both the training and testing phases. Moreover, although a hybrid architecture has shown a superior performance, micro-expressions remain a challenging task to solve because they are more spontaneous and subtle facial movements that occur involuntarily.

This paper also briefly introduced some popular databases related to FER consisting of both video sequences and still images. In a traditional dataset, human facial expressions have been studied using either static 2D images or 2D video sequences. However, because a 2D-based analysis has difficulty handling large variations in pose and subtle facial behaviors, recent datasets have considered 3D facial expressions to better facilitate an examination of the fine structural changes inherent to spontaneous expressions.

Furthermore, evaluation metrics of FER-based approaches were introduced to provide standard metrics for comparison. Evaluation metrics have been widely evaluated in the field of recognition, and precision and recall are mainly used. However, a new evaluation method for recognizing consecutive facial expressions, or applying micro-expression recognition for moving images, should be proposed.

Although studies on FER have been conducted over the past decade, in recent years the performance of FER has been significantly improved through a combination of deep-learning algorithms. Because FER is an important way to infuse emotion into machines, it is advantageous that various studies on its future application are being conducted. If emotional oriented deep-learning algorithms can be developed and combined with additional Internet-of-Things sensors in the future, it is expected that FER can improve its current recognition rate, including even spontaneous micro-expressions, to the same level as human beings.

Acknowledgments

This research was supported by the Scholar Research Grant of Keimyung University in 2017.

Author Contributions

Byoung Chul Ko conceived the idea, designed the architecture and finalized the paper.

Conflicts of Interest

The author declares no conflict of interest.

Subscribe to the PwC Newsletter

Join the community, edit social preview.

facial emotion recognition research papers 2021

Add a new code entry for this paper

Remove a code repository from this paper.

facial emotion recognition research papers 2021

Mark the official implementation from paper authors

Add a new evaluation result row.

  • FACIAL EXPRESSION RECOGNITION
  • FACIAL EXPRESSION RECOGNITION (FER)
  • IMAGE-VARIATION

Remove a task

facial emotion recognition research papers 2021

Add a method

Remove a method.

Include the markdown at the top of your GitHub README.md file to showcase the performance of the model.

Badges are live and will be dynamically updated with the latest ranking of this paper.

Edit Datasets

Deep-emotion: facial expression recognition using attentional convolutional network.

4 Feb 2019  ·  Shervin Minaee , Amirali Abdolrashidi · Edit social preview

Facial expression recognition has been an active research area over the past few decades, and it is still challenging due to the high intra-class variation. Traditional approaches for this problem rely on hand-crafted features such as SIFT, HOG and LBP, followed by a classifier trained on a database of images or videos. Most of these works perform reasonably well on datasets of images captured in a controlled condition, but fail to perform as good on more challenging datasets with more image variation and partial faces. In recent years, several works proposed an end-to-end framework for facial expression recognition, using deep learning models. Despite the better performance of these works, there still seems to be a great room for improvement. In this work, we propose a deep learning approach based on attentional convolutional network, which is able to focus on important parts of the face, and achieves significant improvement over previous models on multiple datasets, including FER-2013, CK+, FERG, and JAFFE. We also use a visualization technique which is able to find important face regions for detecting different emotions, based on the classifier's output. Through experimental results, we show that different emotions seems to be sensitive to different parts of the face.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit.

facial emotion recognition research papers 2021

Results from the Paper Edit

facial emotion recognition research papers 2021

Methods Edit Add Remove

IMAGES

  1. [PDF] Mapping the emotional face. How individual face parts contribute

    facial emotion recognition research papers 2021

  2. Top Quality Facial Emotion Recognition Thesis

    facial emotion recognition research papers 2021

  3. (PDF) Facial Emotion Recognition Using Transfer Learning in the Deep CNN

    facial emotion recognition research papers 2021

  4. (PDF) Review on Facial Emotion Recognition

    facial emotion recognition research papers 2021

  5. Figure 2 from Mapping the emotional face. How individual face parts

    facial emotion recognition research papers 2021

  6. Information

    facial emotion recognition research papers 2021

VIDEO

  1. Facial Emotion Recognition using CNN

  2. Facial Emotion Recognition

  3. MY016

  4. Facial Emotion Recognition System for Visually Impaired Persons

  5. FACIAL EMOTION RECOGNITION AND EMOTION LEVEL DETECTION

  6. REAL TIME EMOTION FACIAL EXPRESSION RECOGNITION BY USING CNN

COMMENTS

  1. A Survey of AI-Based Facial Emotion Recognition: Features, ML & DL

    Facial expressions are mirrors of human thoughts and feelings. It provides a wealth of social cues to the viewer, including the focus of attention, intention, motivation, and emotion. It is regarded as a potent tool of silent communication. Analysis of these expressions gives a significantly more profound insight into human behavior. AI-based Facial Expression Recognition (FER) has become one ...

  2. (PDF) Facial Emotion Recognition: A Brief Review

    The aim. of facial e motion recognition is to help id entify the state of. human emotion (eg; neutral, happy, sad, surprise, fear, anger, disgust, contempt) based on particular facial images. The ...

  3. (PDF) Facial Emotion Recognition

    1. Introduction. Facial Emotion recognition is one of the major areas of. research. Faces analysis indicates recognizing the angle and. expression of a human being independently of the immer ...

  4. Facial expression and body gesture emotion recognition: A systematic

    In visual emotion recognition, facial expression is widely utilized for identifying human emotions, ... Ekman basic emotions: Yang et al. 2021 [58] v: Camera: Dlib, convolutional neural networks, long short-term memory network ... The number of research papers published between 2013 and 2022, as well as the methodologies used for the model ...

  5. Deep learning-based facial emotion recognition for human-computer

    Emotions play a major role during communication. Recognition of facial emotions is useful in so many tasks such as customer satisfaction identification, criminal justice systems, e-learning, security monitoring, social robots, and smart card applications, etc. [1, 2].The main blocks in the traditional emotion recognition system are detection of faces, extracting the features, and classifying ...

  6. Facial emotion recognition using deep learning: review and insights

    Automatic emotion recognition based on facial expression is an interesting research field, which has presented and applied in several areas such as safety, health and in human machine interfaces. ... Conclusion and future work: This paper presented recent research on FER, allowed us to know the latest developments in this area. We have ...

  7. [2105.03588] Facial Emotion Recognition: State of the Art Performance

    Facial emotion recognition (FER) is significant for human-computer interaction such as clinical practice and behavioral description. Accurate and robust FER by computer models remains challenging due to the heterogeneity of human faces and variations in images such as different facial pose and lighting. Among all techniques for FER, deep learning models, especially Convolutional Neural ...

  8. Emotion recognition and artificial intelligence: A systematic review

    Highlights of facial images-based emotion recognition. The summary provided in Table A.8 reveals that the highest number of articles have been from the years 2019 and 2020, respectively. Facial image-based emotion recognition has one article each from the years 2015, 2016, and 2017, respectively.

  9. Facial Emotion Recognition: A multi-task approach using deep learning

    lters to increase the accuracy of Facial emotion recognition. In this paper, a very small dataset [12] of 213 images which lead to over tting on the Neural Network. [14] draws the Bezier curve on the eye and mouth and classi es the emotion of the characteristic with Hausdro distance. Ever since the in popularity of Deep Learning, CNNs have been ...

  10. Facial emotion recognition using convolutional neural networks (FERC

    Facial expression for emotion detection has always been an easy task for humans, but achieving the same task with a computer algorithm is quite challenging. With the recent advancement in computer vision and machine learning, it is possible to detect emotions from images. In this paper, we propose a novel technique called facial emotion recognition using convolutional neural networks (FERC ...

  11. Human Emotion Recognition Based on Spatio-Temporal Facial Features

    Human emotion recognition is crucial in various technological domains, reflecting our growing reliance on technology. Facial expressions play a vital role in conveying and preserving human emotions. While deep learning has been successful in recognizing emotions in video sequences, it struggles to effectively model spatio-temporal interactions and identify salient features, limiting its accuracy.

  12. Automatic Facial Expression Recognition in Standardized and Non

    In this paper, we evaluate and compare three widely used FER systems, namely Azure, Face++ and FaceReader, and human emotion recognition data. ... 2019; Dupré et al., 2020) and human emotion recognition research (Nummenmaa and Calvo, 2015; Calvo and Nummenmaa ... Höfling TTA and Alpers GW (2021) Automatic Facial Expression Recognition in ...

  13. Facial Emotion Recognition Using Transfer Learning in the Deep CNN

    Human facial emotion recognition (FER) has attracted the attention of the research community for its promising applications. Mapping different facial expressions to the respective emotional states are the main task in FER. The classical FER consists of two major steps: feature extraction and emotion recognition. Currently, the Deep Neural Networks, especially the Convolutional Neural Network ...

  14. A study on computer vision for facial emotion recognition

    Farzaneh, A. H. & Qi, X. Facial expression recognition in the wild via deep attentive center loss in 2021 IEEE winter conference on applications of computer vision (WACV) 2401-2410 (IEEE, 2021).

  15. A survey on facial emotion recognition techniques: A state-of-the-art

    1. Introduction. Facial Emotion Recognition performed computationally is a very interesting and challenging task to be explored. Besides interpreting facial emotion expression being a task naturally performed by humans, finding computational mechanisms to reproduce it in the same or similar way is still an unsolved problem [8].Designing and developing algorithmic solutions able to interpret ...

  16. Facial Emotion Recognition

    meiyor/deep-learning-emotion-decoding-using-eeg-data-from-autism-individuals • • 25 Nov 2021. This study is the first to consolidate a more transparent feature-relevance calculation for a successful EEG-based facial emotion recognition using a within-subject-trained CNN in typically-developed and ASD individuals. 1. Paper.

  17. (PDF) Real-Time Facial Emotion Recognition

    PDF | On Oct 1, 2021, Devanshu Shah and others published Real-Time Facial Emotion Recognition | Find, read and cite all the research you need on ResearchGate

  18. A Brief Review of Facial Emotion Recognition Based on Visual

    Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of the main ...

  19. Multimodal Emotion Recognition with Deep Learning: Advancements

    While unimodal emotion recognition has made significant strides, the human experience is inherently multimodal. Multimodal Emotion Recognition (MER) systems strive to emulate this intricate human processing by integrating information from multiple modalities as depicted in Fig. 1.MER leverages the combined information from facial expressions, speech patterns, and physiological signals to ...

  20. Papers with Code

    Facial expression recognition has been an active research area over the past few decades, and it is still challenging due to the high intra-class variation. Traditional approaches for this problem rely on hand-crafted features such as SIFT, HOG and LBP, followed by a classifier trained on a database of images or videos.

  21. (PDF) FACIAL EMOTION DETECTION AND RECOGNITION

    FACIAL EMOTION DETECTION AN D. RECOGNITION. Amit Pandey, Aman Gupta, Radhey Shyam. Computer Science Depa rtment. SRMCEM, AKTU. Lucknow, India. Abstract - Facial emotional expression is a part of ...