\documentclass[a4paper, 11pt, one column, aas_macros]{article}
%% Language and font encodings. This says how to do hyphenation on end of lines.
\usepackage[english]{babel}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{aas_macros}
\usepackage{multicol}
%% Sets page size and margins. You can edit this to your liking
\usepackage[top=0cm, bottom=2.0cm, outer=2.5cm, inner=2.5cm, heightrounded,
marginparwidth=1cm, marginparsep=1cm, margin=1.5cm]{geometry}
%% Useful packages
\usepackage{caption}
\usepackage{graphicx} %allows you to use jpg or png images. PDF is still recommended
\graphicspath{ {./Images/} }
\usepackage[colorlinks=False]{hyperref} % add links inside PDF files
\usepackage{amsmath} % Math fonts
\usepackage{amsfonts} %
\usepackage{amssymb} %
\usepackage{multirow}
%% Citation package
\usepackage[square,numbers]{natbib}
\bibliographystyle{abbrvnat}
\title{CONVOLUTIONAL NEURAL NETWORKS FOR\\ FACICAL EMOTION RECOGNITION}
\author{\small{\rm Hoang-Anh Le}\\VNU-HCM University of Science\\1612013@student.hcmus.edu.vn
\and {\rm Anh-Quoc Pham}\\VNU-HCM University of Science\\1612543@student.hcmus.edu.vn
\and {\rm Thien-Nu Hoang}\\VNU-HCM University of Science\\1612880@student.hcmus.edu.vn}
\date{}
\begin{document}
\maketitle
\begin{multicols}{2}
\begin{abstract}
Facial emotion recognition plays an important role in helping human-machine interaction become more intelligent and natural, and automating many surveys and researches in human behavior, health-care and robotics. Knowing its significant, we implement Convolution Neural Networks (CNNs) for this problem. We have two CNNs models, consist of shallow CNNs, deep CNNs, which have different number of layers and are trained by using Kaggle dataset. In result, the highest accuracy is 65.55\% belong to deep CNNs. Our motivation is understanding clearly about deep learning, particularly CNNs, and put in on real life. Therefore, we also tunned the hyper parameter of each models such as learning rate, batch size, and number of epochs. In addition, we also used techniques to optimize networks, acting as activation function, dropout and max pooling. Finally, we analyzed the result from two models to observe the relationship between number of layer and accuracy. We believe our result will be valuable to make decision how structure of network before construct it.
\end{abstract}
\section{Introduction}
\par With the need of human–computer interaction, the emotion recognition plays an important role in computer science. There are a lot of ways to recognize emotion included through voice, body gesture, specially facial expression what is the most important way for human to display emotions. In fact, facial emotion recognition has been applied for a lot of applications of variety fields such as customer – attentive marketing, health monitoring and emotionally intelligent robotic interface. Therefore, researching about facial emotion recognition has increasingly attracted many scientists in computer vision.
\par In 1971 paper titled “Constants Across Culture in the Face and Emotion”, Ekman et al. identified six facial expressions that are universal across all cultures: anger, disgust, fear, happiness, sadness and surprise. In recent years, research challenge such as Emotion Recognition in the Wild (Emotion) and Kaggle’s Facial Expression Recognition Challenge added the seventh emotion, neutral emotion, into this list for classification.
\par The first successful applications of CNNs were developed by Yann LeCun in 1990's. Of these, the best known is the LeNet architecture that was used to read zip codes, digits, etc.\cite{LeCun}Day by day, CNNs has been developing by the scientist community. Specially, in computer vision, there are a lot of momentous works which use CNNs approach such as AlexNet\cite{AlexNet}, VGG-Net\cite{VGG-Net}, GoogleNet\cite{GoogleNet}, and ResNet\cite{ResNet}. Besides, there are some contribution from public challenges, typically Facial Emotion Recognition challenge in Kaggle (2013) and Emotion Recognition in the Wild challenge (2015).
\par In this paper, we executed CNNs approach for facial emotion recognition. The input in to out system is image from Kaggle dataset, then we use CNNs to train and predict the label facial expression, consist of angry, disgust, fear, happy, sad, surprise, and neutral. We tried to build distinct CNNs systems with various layers to find out the best performance. We reached 65.55\% of highest accuracy. It can be accepted because the winner of FER2013 challenge achieve 71.162\% accuracy. Not only accuracy achieved, we also found some interesting things about CNNs. Although our result and method is not the best, it made us understand deep learning obviously and easier to implement it.This experiment gave us a basic knowledge for out future work.
\par We mention some related work in Section \ref{SecWork}. Dataset is remarked in Section \ref{SecDataset}. Section \ref{SecMethod} show what exactly we did, how our CNNs models in detail. Content in Section \ref{SecResult} is our result. In ending, we also present some conclusion and future work in Section \ref{SecConclusion}
\section{Related Work} \label{SecWork}
From 2000 to 2018, many researchers have developed the facial emotion recognition systems. There have been a lot of approaches to solve this problem, from traditional approaches using handcrafted features to deep-learning-based approaches. \cite{reason} However, CNNs method was used for most of public challenge and resulted high accuracy. In fact, in FER2013 challenge, the winner, Yichuan Tang, used an ensemble
of CNNs trained to minimize the squared hinge loss, and achieved 71.162\% accuracy on test set. \cite{kaggle}. \par
In more recent work, Bo-Kyeong Kim et al won the third Emotion Recognition in the Wild (EmotiW2015) challenge with test accuracy of 61.6\%. They used a large committee of CNNs with two strategies: varying network architecture (e.g. input preprocessing and receptive field size) in order to obtain more diverse models, and constructing a hierarchical architecture of the committee with exponentially-weighted decision fusion in order to form a better committee. \cite{emoti}
\includegraphics[width=.4\textwidth]{Data}
\captionof{figure}{Example of seven emotions in FER2013 dataset: (0) angry, (1) disgusted, (2) fearful, (3) happy, (4) sad, (5) surprised, (6) neutral}
\label{dataImage}
\includegraphics[width = .5 \textwidth]{Training}
\captionof{figure}{Overview FER2013 data}
\label{dataPlot}
\section{Dataset} \label{SecDataset}
We trained and tested our model on Kaggle dataset from Facial Expression Recognition Challenge, which consists of 48x48 pixel grayscale images of faces. The faces have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image. We use training set of 28,709 example, validation test set of 3,589 examples, and test set of another 3,589 examples. \cite{data}
\par
In Figure \ref{dataImage}, we show seven emotional images included: angry, disgust, fear, happy, sad, surprised, neutral. Although all images are preprocessing, we can see that there are various individuals across the entire spectrum of: ethnicity, race, gender and race, with all these images being taken at various angles.
\par
The plot in Figure \ref{dataPlot} show the number images of each emotion. Absolutely, disgust is the smallest data (436 images). We predict it will affect to result of experiment.
\section{Method} \label{SecMethod}
\subsection{Overview}
\par We use Convolution Neural Networks (CNNs) to recognize seven facial emotion expression. In CNNs approach, the input image is convolved through a filter collection in the convolution layers to produce a feature map. Each feature map is then combined to fully connected network, and the face expression is recognized as belonging to a particular class-based the output of softmax algorithm. There are two main reasons why we chose CNNs approach:
\begin{itemize}
\item CNNs is the most popular network model among the several deep-learning model available. Understanding CNNs helps us develop researching deep-learning career in the future. \cite{reason}
\item Using deep-learning for facial emotion recognition highly reduce the dependence on face-physics-based model and other pre-processing techniques by enabling "end-to-end" learning to occur in the pipeline directly from the input images. \cite{reason}
\end{itemize}
\par To reach the motivation of this paper, we implement two classifiers from scratch: (1) shallow CNN with 2 layers, (2) deep CNN with 4 layers. For each of these model, we tunned parameters including learning rate, regularization, and dropout. We also tried using batch normalization and fractional pooling for optimizing time training.
\par According to the results of three models, we compare these with loss and accuracy to understand exactly how CNNs works.
\subsection{Shallow CNN}
\par This network has two convolution layers and one fully connected (FC) layer. In the first convolutional layer, we had 64 filters with kernel size is 3x3, border mode is ‘same’ and value of input shape is (48,48,1) since input image is 48 x 48 pixels grayscale. The second convolutional is a bit different from first one. In this layer, we have 128 5x5 filters. These convolutional layers also along with batch normalization, max-pooling layer and dropout. Pooling setup is 2x2 with a stride of 2 to reduce the size of the receptive field and avoid overfitting. In dropout layer, a fraction of 0.25 is used.
\par
After 2 convolutional layers, network is added a FC layer after being flattened. FC layer has a hidden layer with 256 neurons and loss function is binary cross entropy (softmax).
\par
Also in all the layers, Rectified Linear Unit (ReLU) is used as the activation function to model non-linearity. ReLU is simply and make high performance.
\par
\includegraphics[width = .45 \textwidth]{Architecture_of_Shallow_CNN}
\captionof{figure}{Architecture of Shallow CNN}
\label{Shallow_CNN}
\subsection{Deep CNN}
To observe the effect of adding convolutional layers and FC layers to the network, we build a deep CNN with 4 convolutional layers and 2 FC layers. The first and second convolutional layers and the first FC layer in this network is the same with layers in Shallow CNN. The third and fourth convolutional layer is the same, they have 512 3x3 filters, along with BatchNormalization, Max-pooling layer, Dropout layer and ReLU as activation function. The hidden layer in the second FC layer has more neuron than first FC layer, 512 neuron.
\par
\includegraphics[width = .5 \textwidth]{Architecture_of_Deep_CNN}
\captionof{figure}{Architecture of Deep CNN}
\label{Deep_CNN}
\section{Result} \label{SecResult}
\par In order to compare the results of shallow and deep networks, we computed the confusion matrices for the these models, shown in Figure \ref{cmshallow} and Figure \ref{cmdeep}.
\includegraphics[width = .45 \textwidth]{cnfMatrix_CNNShllow}
\captionof{figure}{Confusion Matrix of Shallow CNN}
\label{cmshallow}
\includegraphics[width = .45 \textwidth]{cnfMatrix_DeepCNN}
\captionof{figure}{Confusion Matrix of Deep CNN}
\label{cmdeep}
\par We can see through the figures that the deep network has a more accurate result than the shallow network's, as most of the cells in the primary diagonal (number of correct recognitions) have higher values , and most of the cells not in the primary diagonal (number of incorrect recognitions) have lower values. Moreover, we can also know which labels are easily to be incorrect recognized, and be confused with other labels. For example, in the shallow network, the Anger, Fear and Neutral often be recognized as label Sad. The label Disgust have a small amount of data, so the accuracy is quite low and not stable.
In addition, we computed the table of accuracy (the percentage of correctly recognitions case of each label).
\begin{center}
\captionof{table}{Recognition Accuracy of each Label}
\begin{tabular}{|c c c|}
\hline
Label & Shallow Network & Deep Network \\ [0.5ex]
\hline\hline
Anger & 40.70\% & 51.24\% \\
\hline
Disgust & 37.5\% & 57.14\% \\
\hline
Fear & 28.49\% & 47.41\% \\
\hline
Happy & 77.71\% & 86.63\% \\
\hline
Sad & 63.61\% & 54.92\% \\
\hline
Surprise & 67.87\% & 76.7\% \\
\hline
Neutral & 48.46\% & 62.45\% \\
\hline
\textbf{Overral} & 56.31\% & 65.55\% \\[1ex]
\hline
\end{tabular}
\end{center}
In the table, we can see the label Happy has the highest correct recognition rate in both network. And the percentage of deep network higher than in shallow network in most of labels.
\section{Conclusion} \label{SecConclusion}
In this paper, we have explored CNNs for recognition facial expression. Firstly, we implemented a shallow CNN, which have two convolution layers, and got a low accuracy (56.31\%). In oder to improve this networks and expect a higher accuracy, we demonstrated a deep CNN by adding two convolution networks into a shallow CNN. The highest accuracy we achieve is 65.55\%. We also found some emotion that is not be recognized well because of shortage of dataset.
\par Through this experiment, we learn how to implement a CNNs model to solve a real life problem. In future, we would like to implement a deeper CNN with a paramaterizable number of convolutional layers, and check that if more number of convolutional layers, the higher accuracy is. Moreover, we would like to extend our model for color images with investigating pre-trained models such as VGG-Net \cite{VGG-Net} and AlexNet\cite{AlexNet}
\bibliography{refs}
\end{multicols}
\end{document}