Chapter 1

Chapter 1: Introduction SummaryHuman emotions are used in our daily lives to help facilitate interactions among other human beings and can be expressed using verbal and non-verbal cues, i.e. facial expressions and body language. Facial expressions are important cues for non-verbal communication among human beings as humans have the ability to accurately recognize emotions.
Emotion recognition is the process of identifying human emotion, which can be done using computational technologies. To improve emotion recognition technologies, socially intelligent software tools have been created, allowing a robot or machine to predict human emotions which in turn enhances its effectiveness in performing various tasks.
An automatic facial emotion recognition system is a vital component in human-machine interaction. Most of the facial expression recognition methods reported to date are focused on recognition of the six primary expression categories such as: Happiness, anger, sadness, fear, disgust and fear.
Accurately understanding the emotional intention of others is very important in interpersonal communication. Emotional recognition dysfunction impairs interpersonal relationships and lowers quality of life.
1.2. ObjectivesThe main objective of this Emotion Recognition project is to be able to come up with an automated emotion recognition program that allows accurate prediction of “pain” in hospitals, to assist nurses in monitoring patients i.e. elderly, patients with Parkinson’s disease, patients with speech difficulties and psychological disorders. Till date, hospitals and doctors still use a pain scale (Figure 1) to help assess a patient’s pain level, which is not very accurate in determining their pain levels and can be easily misused. To achieve this, different methodologies have been used for feature extraction and classification of the six basic human emotions first.

Figure SEQ Figure * ARABIC 1. Example of Pain Scale used in Hospitals1.3. EmotionsEmotion is a complex, subjective experience accompanied by biological and behavioral changes. Emotion involves feeling, thinking, activation of the nervous system, physiological changes (e.g. Pressure, heart pulse, etc.), and behavioral changes such as facial expressions.

When dealing with emotions-related subjects, researchers often take reference from Ekman’s work, reducing someone’s entire emotional experience to just six basic universal emotions: happiness, sadness, anger, disgust, surprise, and fear (Ekman P., 1970).
Figure SEQ Figure * ARABIC 2. Six Universal Basic Emotions1.4. Emotion RecognitionFacial expression is one of the many ways that we can pick up on the emotions of other people as they provide the building blocks to understanding emotions. Therefore, to help recognize emotions, images of facial expressions will be collected for training and prediction.

To address the problem of facial emotion recognition, several template matching methods have been proposed in the last decades. In most of the cases, the process of emotion recognition from human face images is divided into two main stages: feature extraction and classification. The main aim of feature extraction methods is to minimize intra-class variations and maximize inter-class variations.
The most important facial elements for human emotion recognition are eyes, eyebrows, nose, mouth and skin texture. Therefore, a vast majority of feature extraction methods focus on these features. The main purpose of the classification part is to differentiate the elements of different emotion classes to enhance emotion recognition accuracy.

1.4.1. Pre-existing Techniques of Emotion Recognition1) Facial Action Coding System (FACS)
Facial Action Coding System (FACS) is a system developed to determine human facial expressions, originally developed by Paul Ekman and Wallace V. Friesen, published in 1978 and has since undergo several revisions.
It is an anatomical system for describing all observable facial movement. It breaks down facial expressions into individual components of muscle movement and it is a common standard to systematically categorize the physical expression of emotions.
The FACS manual is a comprehensive description of facial behavior and it is self-instructional. It labels each observable component of facial movement as an Action Unit (AU). The FACS manual describes the criteria for observing and coding each AU. It also describes how AUs appear in combinations.

Figure SEQ Figure * ARABIC 3. FACS Manual – a comprehensive description of facial behaviourAlthough it only focuses on positioning of the facial muscles, the relative positioning of the muscles is what would be used later by researchers to help detect emotions. FACs codes are not usually used in present-day computer algorithms in detecting or recognizing emotions, but it has helped to form the psychological basis for the feature extraction stage – where the positions of the parts of the face are taken together to create a feature vector. Recently, FACS has been established as a computed automated system that detects faces in videos, extracts the geometrical features of the faces, and then produces temporal profiles of each facial movement.

For Human-Computer-Interaction (HCI) to reach greater heights, it is very important that the machine be able to understand our expressions. Examples of current technology that has implemented frameworks of FACS into their Emotion Recognition System is Affectiva and nViso.
Figure SEQ Figure * ARABIC 4. nViso using FACS to capture and measure facial muscles involved in expression of emotions in real-time, by tracking 170 facial points2) Statistical Methods
In this project, Statistical Methods has been used.
Statistical Methods usually involves the use of different supervised machine learning algorithms, in which a large set of annotated data is fed into the algorithms for system to learn and predict the appropriate emotion types. Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned.

Two sets of data are required for this approach:
Training dataset – used to learn the attributes of the data
Testing dataset – used to validate the performance of the machine learning algorithm
Machine learning algorithms are generally able to provide a more reasonable classification accuracy as compared to other approaches. However, to achieve superior results in the classification process requires a sufficiently large training set.

In conventional facial emotion recognition approaches, the facial emotion recognition is composed of three major steps, as shown in Figure 1: (1) input image (2) face and facial component detection, (3) feature extraction, and (4) expression classification.
First, a face image is detected from an input image, and facial components (e.g., eyes and nose) or landmarks are detected from the face region. Second, various spatial and temporal features are extracted from the facial components. Third, the pre-trained facial expression classifiers, such as a support vector machine (SVM), and random forest, produce the recognition results using the extracted features.
Figure SEQ Figure * ARABIC 5. Procedure used in conventional facial emotion recognition approaches:From input images (a), face region and facial landmarks are detected (b), features extracted from the face components and landmarks (c), and the facial expression is determined based on one of the facial categories using pre-trained classifiers (face images are taken from CK+ dataset and Google images)
Some of the most commonly used machine learning algorithms for classification include:
Support Vector Machines (SVM)
Principal Components Analysis (PCA)
Linear Discriminant Analysis (LDA)
Deep learning
For more information, refer to Chapter 3. Below is a brief explanation of point (1) – (4).

Support Vector Machines (SVM)
SVMs are linear classifiers that maximize the margin between the decision hyperplane and the examples in the training set. So, an optimal hyperplane should minimize the classification error of the unseen test patterns.
Figure SEQ Figure * ARABIC 6. Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.Principal Component Analysis (PCA)
For self-organized learning method, principle component analysis (PCA) is widely used in the field of data compression and feature extraction. There are two basic approaches to the computation of principal components: batch and adaptive methods. The batch methods include the method of Eigen decomposition and the method of singular value decomposition (SVD), while the adaptive methods are mainly done by neural networks. In my project, batch method is used. The main target of PCA is to explain the variance–covariance structure of the data through a few linear combinations of the original variables. The main concerning thing about PCA is that it utilizes only the global information of face images, this method is not very effective for different facial expressions.

Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) method is used in statistics and pattern recognition to find a linear combination of features. The resulting combination may be used as a linear classifier or, more commonly, for dimensionality reduction before later classification. LDA explicitly attempts to model the difference between the classes of data.
PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.

Deep Learning (State-of-the-art results)
Deep Learning is under the unsupervised family of machine learning. It structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own. In recent decades, there has been a breakthrough in deep-learning algorithms applied to the field of computer vision. The main advantage of using deep learning is to completely remove or highly reduce the dependence on physics-based models or other pre-processing techniques by enabling “end-to-end” learning directly from input images. As such, deep learning techniques i.e. Convolutional Neural Networks (CNN) has achieved state-of-the-art results in various fields, including object recognition, face recognition, scene understanding, and Facial Emotion Recognition.

Well-known deep learning algorithms include different architectures of Artificial Neural Network (ANN) such as Convolutional Neural Network (CNN), Long Short-term Memory (LSTM), and Extreme Learning Machine (ELM).

The popularity of deep learning approaches in the domain of emotion recognition maybe mainly attributed to its success in related applications such as in computer vision.

Chapter 2: Requirement Analysis
The software requirements, overview of program, background research of the techniques used, and the step-by-step instruction for the program is included in this Chapter.

2.1. Hardware RequirementsTo implement the emotion recognition, we use windows 10 as, it serves as a modular platform for execution of python-based codes. The basic needs in hardware include, a windows 10 operating system/raspberry pi and a camera, which on physical implementation, a webcam can be used with appropriate placement.

2.2. Software RequirementsAnaconda Version 5.2, Jupyter Notebook
Python Version 3.6 (Programming Language)
OpenCV Version 3.4.1 (Library)
NumPY Version 1.15 (Library)
Scikit Learn Version 0.19.1 (Library)
Scikit Image Version 0.14.0 (Library)
2.2.1. Python
Python is a powerful high-level, general-purpose interpreted, object-oriented programming language created by Guido van Rossum during 1985-1990. It has simple easy-to-use syntax, making it the perfect language for someone trying to learn a new programming language for the first time or own their own
Python has recently emerged as a first-class citizen in modern software development, infrastructure management, and data analysis, and is now a major force in web application creation and systems management, and a key driver of the explosion in big data analytics and machine intelligence. Perfect for IT, Python simplifies many kinds of work, from system automation to working in cutting-edge fields like machine learning.

The choice for choosing Python for this project is due to the following reasons:
It is easy to learn and easy to use.
The number of features in the language itself is modest, requiring relatively little investment of time or effort to produce your first programs. The Python syntax is designed to be readable and straightforward. This simplicity makes Python an ideal teaching language, and it lets newcomers pick it up quickly.
It is broadly adopted, versatile, and supported
Python is both popular and widely used, as the high rankings in surveys like the  HYPERLINK “https://www.tiobe.com/tiobe-index/” Tiobe Index and the large number of GitHub projects using Python attest. Python runs on every major operating system and platform, and most minor ones too. Many major libraries and API-powered services have Python bindings or wrappers, letting Python interface freely with those services or directly use those libraries. 
Rich Ecosystem, Strong Standard Library
The success of Python rests on a rich ecosystem of first- and third-party software. Python benefits from both a strong standard library and a generous assortment of easily obtained and readily used libraries from third-party developers. Python has been enriched by decades of expansion and contribution.

The default Python distribution also provides a rudimentary, but useful, cross-platform GUI library via Tkinter, and an embedded copy of the SQLite 3 database. The thousands of third-party libraries, available through the Python Package Index (PyPI), constitute the strongest showcase for Python’s popularity and versatility.

Python programming can be used for the applications below:
General application programming
Data Science and Machine Learning
Web Services and RESTful APIs .Metaprogramming and code generation.

Great Community and SupportLast but not least, Python has a large supporting community. There are numerous active forums online which can be handy if you are stuck. Some of them are:
Learn Python subredditGoogle Forum for PythonPython Questions – Stack Overflow2.2.2. Libraries used2.2.2.1. OpenCVOpenCV (Open Source Computer Vision Library) is a source computer vision and machine learning library. It is used by companies, research groups and governmental bodies. It has more than 2500 optimized algorithms, containing a ccomprehensive set of classic and state-of-the-art computer vision and machine learning algorithms.

An application of OpenCV in real world applications include Detection of swimming pool drowning accidents in Europe.

2.2.2.2. NumPYNumPy is the fundamental package for scientific computing with Python. It contains among other things; a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and FORTRAN code, and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

2.2.2.3. ScikitScikit-learn (formerly scikits.learn) is an Open source, commercially usable (BSD license) free software machine learning library for the Python programming language. It also has multiple simple and efficient tools for data mining and data analysis. Accessible to everybody, and reusable in various contexts. Built on NumPy, SciPy, and matplotlib
It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Chapter 3: Design and Approach of Project3.1. OverviewThe overview of the program can be seen in Figure 1 below.
The images are first loaded from a folder called “Emotions” – this is the train images, where the images of the emotions are stored in their respective emotion names. The images are taken from Google Images and the Cohn-Kanade (CK) Dataset. The images then go through preprocessing, followed by feature extraction, and finally classification, before an output is achieved.
The preprocessing steps include detection of the face using the Viola-Jones method, and cropping the face detected into 350 by 350 pixels. For feature extraction and classification. the FisherFaces algorithm uses Linear Discriminant Analysis (LDA), EigenFaces algorithm uses Principal Components Analysis (PCA), Local Binary Patterns Histogram (LBPH) uses texture descriptors, and lastly, HOG uses descriptors that are applied to the cropped image and then HOG is preformed to create a feature vector, where this feature vector is passed to a Support Vector Machine (SVM).

After the above steps, the test images are then loaded and features are compared to the trained model and classified accordingly.

3.1. Choosing ImagesInformation within the image background, such as hair, spectacles and facial hair i.e. beard, can negatively affect emotion recognition methods and accuracy. To improve the emotion recognition methods, choosing images with desirable features which the method requires, including their relative muscle position, is extremely crucial and important.
To ensure that maximal accuracy is obtained, images with specific characteristics were manually selected from Google Images and the Cohn-Kanade (CK+) Dataset and saved in the Training Dataset. Figure 2. below shows the images extracted from the folder “surprise”, which contains both the Google Images and CK+ Dataset (Grayscale photos). In Figure 3, it shows what are the characteristics that were focused on when selecting the photos for the Training Dataset.

3.2. Pre-processing using Viola-Jones Algorithm3.2.1. Introduction to Viola-Jones (V-J) algorithmThe V-J algorithm is also known as the Haar Cascading Classifier, uses the Cascade Function to train using ‘positive’ images and ‘negative’ images. Positive images are images that match the target object (face), while negative images are irrelevant images (i.e. Trees, Cars) that are used to distinguish between what is a face, and what is not.

An open source code created by various authors to detect the face region under the Intel License Agreement For Source Computer Vision Library, also used by OpenCV, is used to perform the V-J method (OpenCV Team, 2018). V-J pretrained cascades for frontal faces are also available in OpenCV. The algorithm is implemented in OpenCV as cvHaarDetectObects().k.
Figure SEQ Figure * ARABIC 11. Haar Features types and how it is applied in imagesThe image of the face would be cropped based on the face area when the face is detected.

In my project, 4 different types of Haar Cascade Classifiers have been used:
Haarcascade_frontal_defaultHaarcasscade_frontalface_altHaarcascade_frontalface _alt2
Haarcascade_frontalface_alt_treeThe cascading classifier selects the region of face without considering the size of the area or image. However, this also means that each image is effectively cropped to its own size, all within a small margin of each other, but nonetheless any difference will result in a feature vector of different lengths, which is not allowed by the chosen learning techniques. Therefore, each image must be resized to an identical size; 350×350 pixels. This was the maximum size achieved after cropping for all images, thus minimizing data lost by resizing and avoiding entirely adding padded data by resizing larger.
The cropping is seen in the change in Figure 5 below. Images of equal size are crucial and essential for proper classification.

Figure SEQ Figure * ARABIC 12. Original Image (Left) to Cropped up Image (Right), 350×350 pixels.

3.2.1.1. Viola-Jones Algorithm FlowchartIt has four stages: Haar Feature Selection, Creating an Integral Image, Ada Boost Training, and Cascading Classifiers. Figure 6. shows the flowchart of the Viola-Jones algorithm.

Figure SEQ Figure * ARABIC 13. Flowchart of the Viola Jones AlgorithmHaar Features Selection
The human faces share regularities that can be matched using Haar Features. (Refer to Figure 4) Some of the Haar Features properties that are common in human faces:
Eye region is darker than the upper cheek
Figure SEQ Figure * ARABIC 14. Haar Features Selection – Eye regionThe nose bridge is brighter than the eyes
Figure SEQ Figure * ARABIC 15. Haar Features Selection – Nose regionComposition of properties forming similar facial features include the location and size (eyes, mouth, bridge of nose) and value (oriented gradients of pixel intensities).

The rectangle features show the difference in brightness between the white and black rectangles over a specific area, and each feature is related to a special location in the sub-window.

Creating an integral image
Rectangle features can be computed very rapidly using an intermediate representation for the image which we call the integral image. The integral image at location x,y, contains the sum of the pixels above and to the left of x,y, inclusive:

An image representation, known as the integral image helps to evaluate rectangular features in constant time and thus allowing a greater speed advantage over more sophisticated alternative features. It rapidly computes Haar-like features.

AdaBoost Training
AdaBoost algorithm is then used to select the best features and to train classifiers that use them. It chooses a tiny number of important visual features from a larger set and gives classifiers that are very efficient.

The AdaBoost algorithm constructs a “strong classifier” as a linear combination of weighted simple “weak classifiers”.
Cascading Classifiers
The human face consists of only 0.01% positive (faces) sub windows on average. It is time consuming to compute negative sub windows, hence, time is focused on sub windows that are positive. To achieve this, a 2 feature classifier can be used. The first one acts as a first line defense to remove all negative sub windows, and second one can be used to remove the negatives that were harder to detect in the first later, etc. Gradually, much complicated classifiers cascade can achieve better rates of detection.
The advantages and disadvantages of Viola-Jones algorithm:
Advantages:
Robust. Extremely fast feature computation, very high detection rate (true-positive rate) and very low false-positive rate.

Efficient feature selection
Real time. For Practical applications at least 2 frames per second must be processed.

Face Detection. Able to distinguish faces from non-faces.

Disadvantages:
Detector is most effective only on frontal images of faces. Poor when detecting titled or turned faces.

Influenced by the clarity of image
It is sensitive to lighting conditions
Training takes a lot of time to separate a negative face from a positive face
3.3. Feature ExtractionFor feature extraction, there are a lot of various kinds of methods to be used. In my project, the Eigenfaces, Fisherfaces, Local Binary Patterns Histogram, Histogram and Oriented Gradients (HOG) and Support Vector Machine (SVM) techniques will be used and covered in this section.

3.3.1. EigenFaces AlgorithmThe task of facial recognition is discriminating input signals (image data) into several classes (emotions). The input signals are highly noisy (e.g. the noise can be caused by differing lighting conditions, pose etc.), yet the input images are not completely random and despite their differences there are patterns which occur in any input signal. Such patterns, which can be observed in all signals could be in the domain of facial recognition or the presence of some objects (eyes, nose, mouth) in any face as well as relative distances between these objects.
Eigenfaces are the set of eigenvectors which used in computer vision problem for human facial recognition. They can be simply defined as the eigenvectors which represent one of the dimension of face image space. All eigenvectors have an eigenvalue associated to it and the eigenvectors with the largest eigenvalues provide more information on the face variation than the ones with smaller eigenvalues. These characteristic features are called eigenfaces in the facial recognition system and they can be extracted out of original image data by means of a mathematical tool called Principal Component Analysis (PCA). 
It is to be noted that Eigenfaces algorithm is highly affected by illumination.

3.3.1.1. Principal Component Analysis (PCA) TechniquePrincipal Component Analysis (PCA) is one of the most commonly used unsupervised learning algorithm that can be used to simplify a dataset. It helps compress and extract features for data and even for dimensionality reduction purposes. PCA is commonly used to reduce dimensionality by extracting the smallest number components (i.e. major features/directions) that account for most of the variation in the original data. It is able to best explain the variation in the set of images, help to get rid of the redundancy and preserve the variance in a smaller number of coefficients. PCA finds lines in 2-Dimensional, and planes in 3-Dimensional, a higher dimensional space that approximate the data in least squares.
Using PCA, we can transform each original image of the training dataset into a corresponding eigenface. PCA reconstructs any original image from the training set by combining the eigenfaces (characteristic features of the faces). Therefore, this means that the original face image can be reconstructed from eigenfaces when we add up all the eigenfaces (features) in the right proportion, as seen in the figure below.

Each eigenface only represents certain characteristic features of the face, which may or may not be present in the original image. If the feature is present in the original image to a higher degree, the share of the corresponding eigenface in the sum of the eigenfaces should be greater. On the other hand, if the feature is not present in the original image, the corresponding eigenface would contribute a smaller part to the sum of eigenfaces.
Hence, to reconstruct the original image from the eigenfaces, a weighted sum of all eigenfaces must be built using a large set of digitized images of human faces, taken under the same lighting conditions and should be normalized to line up the eyes and mouths. As the reconstructed original image is equal to a sum of all eigenfaces with each eigenface having a certain weight – this weight then specifies to what degree the specific feature characteristic of the face is present in the original image. If we use all the eigenfaces extracted from original images, we can then reconstruct the original images from the eigenfaces exactly. However, we can also use part of the eigenfaces but that would mean that the reconstructed image would only be an approximation of the original image. Losses due to omitting some of the eigenfaces can be minimized by choosing only the most important features (eigenfaces). Omission of eigenfaces may sometimes be necessary due to scarcity or lack of computational resources. 
It is possible to extract the face from eigenfaces given a set of weights, and to go the opposite way to extract the weights from eigenfaces and the face to be recognize. The weights amount to which the face differs from typical face represented by the eigenfaces. Therefore, using these weights we can determine two important things:
Determine if a face is present in an image.

If the weights of the mage differ too much from the weights of face images, the image is most likely not a face.Similar faces.

Typically, similar faces would possess similar features (eigenfaces) to similar degrees (weights). By extracting weights from all the images available, the images could be grouped to clusters.

In summary, the purpose of PCA is to reduce the large dimensionality of the face space (observed variables) to the smaller intrinsic dimensionality of feature space (independent variables), which are needed to describe the data economically. This is the case when there is a strong correlation between the observed variables.

3.3.1.2. Eigenfaces AlgorithmTo extract the characteristic features of an image of a face, we can capture a large set of face images and using the features extracted to encode and compare individual face images, which would then ultimately allow us to classify and image based on the trained dataset. In mathematical terms, the principal components of the distribution of faces, or the eigenvectors of the covariance matrix of the set of face images, treating an image as point (or vector) in a very high dimensional space is sought. Each image location contributes to each eigenvector, so that it is possible to display these eigenvectors as a sort of ghostly face image which is called an eigenface, as shown in figure 10 above.
The Eigenfaces Algorithm flowchart in Figure 11 illustrates the types of stages in an Eigenface based recognition system.
(1) At the start, a set of images is loaded into a database. IThese images are called the training set as they will be trained in a model and then saved in XML format. In my project, they will be saved as “eface-dataset83”, “eface-dataset83-reduced”, “eigenface-emotionsdataset” under “trained-models” folder in XML format.

(2) The second step is to create the eigenfaces. Eigenfaces are created by extracting characteristic features from the faces. The input images are normalized to line up the eyes and mouths then resized to the same size (350 by 350 pixels). Eigenfaces can now be extracted from the image data by using a mathematical tool called Principal Component Analysis (PCA).
(3) When the eigenfaces have been created, each image will be represented as a vector of weights.
(4) Test images are then loaded as input image.

(5) The weight of the incoming input image is found and then compared to the weights of those already in the system. If the input image’s weight is over a given threshold it is unidentified. The identification of the input image is done by finding the image in the database whose weights are the closest to the weights of the input image. The image in the database with the closest weight will be returned as a hit to the user of the system.

For more knowledge regarding the equations used and the step-by-step technique used to eventually produce the eigenfaces, you can refer to the links here and here.

3.3.2. Fisherfaces Recognition IntroductionFisherface Recognition uses a feature extraction method to look for class-specific linear equations, Fisher’s Linear Discriminant Analysis (FLDA) or (LDA).

Fisherface method in OpenCV library require data training and data testing are in grayscale image and assumes that training data has same size with testing data. Since the training input dataset heavily affects the performance of the Fisherfaces algorithm, it is important to use similar training and testing datasets, in terms of characteristics and data size.
3.3.2.1. Fisher’s Linear Discriminant Analysis (FLDA/LDA)Fisher’s Linear Discriminant (FLDA) is a widely used method for feature extraction and dimensionality reduction in pattern recognition. It is also a method used in statistics and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. In short, FLDA tries to find the best project direction in which training samples belonging to different classes are best separated.
The Fisherface method is an enhancement of the Eigenface method, where Fisherface uses Fisher’s Linear Discriminant Analysis (FLDA/LDA) for dimensionality reduction while Eigenface method uses Principal Component Analysis(PCA) to linearly project the image to a low dimensional feature space.

LDA minimizes the variance within a class, while maximizing the variance between classes at the same time, as seen in Figure 9.

When LDA is used to find the subspace representation of a set of face images, the resulting basis vectors defining that space are known as Fisherfaces.

LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.

LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis.
Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification – the act of distributing things into groups, classes or categories of the same type.

Assumptions for Fisherfaces: (1) N is the number of images in the database, (2) c is the number of persons in the database.

3.3.2.2. FisherFaces AlgorithmFirstly, images are loaded and these iamges are known as the training dataset. The average of all faces is computed using an equation (1). The average face of each person is then computed (2). Subtracting (2) from (1) for each face, scatter matrices are then built. And within the scatter and between-class scatter matrix, a matrix called W is then generated, maximizing the difference between the two scatter matrices. If within-class scatter matrices are nonsingular, there is a need to compute the inverse of the within-class scatter matrices, multiply the matrices and then compute the eigenvectors (Fisherfaces algorithm is an advancement of eigenfaces). However, if within-class scatter matrices are singular, PCA is applied first, and then LDA is applied.
For more in-depth details regarding the equations used for FLDA, refer to this link here.

3.3.3. Local Binary Patterns Histogram IntroductionLocal Binary Patterns Histogram (LBPH) is provided by the OpenCV library (Open Source Computer Vision Library). It is a simple and very efficient texture operator. It labels the pixels of an image by thresholding the neighborhood of each pixel and considers the result as a binary number as seen in Figure 14 below.

The starting point for LBPH was the idea that two-dimensional textures could be described by two complementary local measures:
1) Spatial structure (pattern) – In terms of gray-scale and rotation invariance, affected by rotation of the texture only2) Contrast (the amount of local image texture) – In terms of gray-scale and rotation invariance, affected by the gray-scale only
3.3.3.1. LBPH as a texture descriptorTherefore, if a texture descriptor is capable in separating the texture’s pattern information from its contrast, then invariance to monotonic gray scale changes can be obtained. Using distributions of LBP provided excellent texture classification. Local binary texture patterns were found to be uniform, corresponding to primitive micro-features, such as edges, corners, and spots; hence they can be regarded as feature detectors that are triggered by the best matching pattern. This presented LBP to be fundamental in the development of a generalized gray-scale and rotation invariant operator. LBP was also combined with other descriptors such as SIFT, to consolidate their strengths, which proved to be effective in interest regions detection.
Due to its computational simplicity, LBP was used early in some applications such as visual inspection. The high potential of LBP is clearly demonstrated from all these outcomes and motivated future research. LBP also happens to be one of the most used texture descriptors in medical image analysis, shown to be useful in describing medical images accurately in high detail. Much research lead to the development of variants of the LBP which are widely considered the state of the art among known texture descriptors. The histogram is one of the most commonly used texture parameters for analysis of medical images.
Coming from the statistical class of texture techniques, an image histogram displays the count of pixels in an image possessing a given gray level value. Many other parameters can also be derived from the histogram, such as its mean, variance and percentiles. The discrete occurrence histogram of the local binary texture patterns computed over an image or region of image was and still is a very powerful texture feature. By computing the occurrence histogram, one can effectively combine the structural and statistical approaches. This was demonstrated this through LBP’s detection of micro-structures such as edges, lines and spots, whose underlying distribution was estimated by an occurrence histogram. A later investigation was also carried out to find the relationship of LBP to a method based on multidimensional gray scale difference histograms. The results of this research created a simplification of the LBP operator based on signed gray level differences. Vector quantization was used to reduce the dimensionality of the feature space of multidimensional histograms forming a one-dimensional text on histogram. Other researchers further found that instead of using only the LBP histogram derived from standard position textures, they could improve the discriminative classification power by extracting a certain small subset of LBPs from the histogram. This showed optimization in classifying tilted 3-D textures even without pre-processing. LBP histograms are continually used in other fields than image analysis. The method was shown to be tolerant to illumination variations, the multi-modality of the background, and the introduction or removal of background objects. In 3D textured surfaces, multiple LBP histograms were used as object models and this produced excellent results. LBP histograms were also used in face analysis research, it enabled methods to be robust to face misalignments and pose variations.

3.3.3.2. LBPH AlgorithmParameters:
Radius: the radius is used to build the circular local binary pattern and represents the radius around the central pixel. It is usually set to 1.

Neighbours: the number of sample points to build the circular local binary pattern. Keep in mind: the more sample points you include, the higher the computational cost. It is usually set to 8.

Grid X: the number of cells in the horizontal direction. The more cells, the finer the grid, the higher the dimensionality of the resulting feature vector. It is usually set to 8.

Grid Y: the number of cells in the vertical direction. The more cells, the finer the grid, the higher the dimensionality of the resulting feature vector. It is usually set to 8
Training the Algorithm:
First, we need to train the algorithm. To do so, we need to use a dataset with the facial images of the people we want to recognize. We need to also set an ID (it may be a number or the name of the person) for each image, so the algorithm will use this information to recognize an input image and give you an output. Images of the same person must have the same ID. With the training set already constructed, let’s see the LBPH computational steps.

Applying the LBP operation: The first computational step of the LBPH is to create an intermediate image that describes the original image in a better way, by highlighting the facial characteristics. To do so, the algorithm uses a concept of a sliding window, based on the parameters radius and neighbours.

Figure SEQ Figure * ARABIC 22. Computation of LBPH from pixels to decimal.With reference to Figure 15 above, suppose we have a facial image in grayscale, we can get part of this image as a window of 3×3 pixels. It can also be represented as a 3×3 matrix containing the intensity of each pixel (0~255). Then, we need to take the central value of the matrix to be used as the threshold. This value will be used to define the new values from the 8 neighbors. For each neighbor of the central value (threshold), we set a new binary value. We set 1 for values equal or higher than the threshold and 0 for values lower than the threshold.

Now, the matrix will contain only binary values (ignoring the central value). We need to concatenate each binary value from each position from the matrix line by line into a new binary value (e.g. 10001101). Note: some authors use other approaches to concatenate the binary values (e.g. clockwise direction), but the final result will be the same.

Then, we convert this binary value to a decimal value and set it to the central value of the matrix, which is actually a pixel from the original image. At the end of this procedure (LBP procedure), we have a new image which represents better the characteristics of the original image.

Note: The LBP procedure was expanded to use a different number of radius and neighbors, it is called Circular LBP.

It can be done by using bilinear interpolation. If some data point is between the pixels, it uses the values from the 4 nearest pixels (2×2) to estimate the value of the new data point.

Extracting the Histograms: Now, using the image generated in the last step, we can use the Grid X and Grid Y parameters to divide the image into multiple grids, as can be seen in the following image:
Based on the image above, we can extract the histogram of each region as follows:
As we have an image in grayscale, each histogram (from each grid) will contain only 256 positions (0~255) representing the occurrences of each pixel intensity.

Then, we need to concatenate each histogram to create a new and bigger histogram. Supposing we have 8×8 grids, we will have 8x8x256=16.384 positions in the final histogram. The final histogram represents the characteristics of the image original image.

Performing the emotion recognition: In this step, the algorithm is already trained. Each histogram created is used to represent each image from the training dataset. So, given an input image, we perform the steps again for this new image and creates a histogram which represents the image.

So to find the image that matches the input image we just need to compare two histograms and return the image with the closest histogram.

We can use various approaches to compare the histograms (calculate the distance between two histograms), for example: euclidean distance, chi-square, absolute value, etc. We can then use a threshold and the ‘confidence’ to automatically estimate if the algorithm has correctly recognized the image. We can assume that the algorithm has successfully recognized if the confidence is lower than the threshold defined.

3.3.4. Histogram of Oriented Gradients (HOG) IntroductionHOG captures features by counting the occurrence of gradient orientation. Traditional HOG divides the image into different cells and computes a histogram of gradient orientations over them. HOG is being applied extensively in object recognition areas as facial recognition. Local shape information often well described by the distribution of intensity gradients or edge directions even without precise information about the location of the edges directions.

The HOG descriptor can overcome the problem of varying illumination as it is invariant to lighting conditions. It is used to extract magnitude of edge information and works well even during variations in poses and illumination conditions. HOG works well under such challenging situations as it represents directionality of edge information thereby making it significant for the study of pattern and structure of the interested object.

3.3.4.1. Histogram of Oriented Gradients Feature ExtractorThe Algorithm Flowchart Steps:
Global Image Normalization to ensure that the mouth and eyes are at the same level across the images.

Divide image into small sub-images: “cells”. Cells can be rectangular (R-HOG) or circular (C-HOG). Each cell accumulating a weighted local 1-D histogram of gradient directions over the pixels of the cell.
Computing HOG: Accumulate a histogram of edge orientations within that cell. The combined histogram entries are used as the feature vector describing the object.
Normalization: To provide better illumination invariance (lighting, shadows, etc.) normalize the cells across larger regions incorporating multiple cells: “blocks”. Normalize gamma and color by RGB and LAB to normalize the energy of the cells.
Collect HOGs over detection window.

Linear SVM for object/non-object classifications.

In detail, the first stage applies an optional global image normalization equalization that is designed to reduce the influence of illumination effects. In practice we use gamma (power law) compression, either computing the square root or the log of each color channel. Image texture strength is typically proportional to the local surface illumination so this compression helps to reduce the effects of local shadowing and illumination variations.

The second stage computes first order image gradients. These capture contour, silhouette and some texture information, while providing further resistance to illumination variations. The locally dominant color channel is used, which provides color invariance to a large extent. Variant methods may also include second order image derivatives, which act as primitive bar detectors – a useful feature for capturing, e.g. bar like structures in bicycles and limbs in humans.

The third stage aims to produce an encoding that is sensitive to local image content while remaining resistant to small changes in pose or appearance. The adopted method pools gradient orientation information locally in the same way as the SIFT feature. The image window is divided into small spatial regions, called “cells”. For each cell we accumulate a local 1-D histogram of gradient or edge orientations over all the pixels in the cell. This combined cell-level 1-D histogram forms the basic “orientation histogram” representation. Each orientation histogram divides the gradient angle range into a fixed number of predetermined bins. The gradient magnitudes of the pixels in the cell are used to vote into the orientation histogram.

The fourth stage computes normalization, which takes local groups of cells and contrast normalizes their overall responses before passing to next stage. Normalization introduces better invariance to illumination, shadowing, and edge contrast. It is performed by accumulating a measure of local histogram “energy” over local groups of cells that we call “blocks”. The result is used to normalize each cell in the block. Typically each individual cell is shared between several blocks, but its normalizations are block dependent and thus different. The cell thus appears several times in the final output vector with different normalizations. It improves the performance. We refer to the normalized block descriptors as Histogram of Oriented Gradient (HOG) descriptors.

The last step collects the HOG descriptors from all blocks of a dense overlapping grid of blocks covering the detection window into a combined feature vector for use in the window classifier.

3.4 Classification3.4.1. Support Vector Machine (SVM)Support Vector Machines (SVMs) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes the different classes of support vectors, as shown in Figure 25 below.
SVM can be used for classification, regression and outliers’ detection. It is however, used mainly for classification purposes. In this algorithm, each data item is plot as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the classes very well.

Support Vectors are the co-ordinates of individual observation. Support Vector Machine is a frontier which best segregates the two classes (hyper-plane/line).

The advantages of support vector machines are:
Effective in high dimensional spaces and in cases where number of dimensions is greater than the number of samples.

Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

Generalization of binary and regression forms, and notation simplification.

Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:
If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.

SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

For Classification, SVM uses several kernels such as the polynomial kernel, linear kernel, and the gaussian radial basis function (RBF) kernel.
Parameters that are crucial in SVM are: Gamma, Regularization (C) parameter and kernel. Varying those we can achieve considerable nonlinear classification with more accuracy in reasonable amount of time.
Kernel
Kernel defines whether we want a linear of linear separation. The learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra. This is where the kernel plays role.
For linear kernel, the equation for prediction for a new input using the dot product between the input (x) and each support vector (xi) is calculated as follows: f(x) = B(0) + sum(ai * (x,xi)) This is an equation that involves calculating the inner products of a new input vector (x) with all support vectors in training data. The coefficients B0 and ai (for each input) must be estimated from the training data by the learning algorithm.

The polynomial kernel can be written as K(x,xi) = 1 + sum(x * xi)^d and exponential as K(x,xi) = exp(-gamma * sum((x?—?xi²)). Polynomial and exponential kernels calculate separation line in higher dimension, which is what I have used in my project and it has provided me better accuracy. This is called kernel trick.

Regularization, C parameter
The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the SVM optimization how much you want to avoid misclassifying each training example.

For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.

The images below are example of two different regularization parameter.
C=1: Blue dot is misclassified, separation between 2 class is bigger therefore, it is less easily misclassified.

C=100: Blue dot is classified but separation between 2 class is smaller therefore, it is easily misclassified.

Gamma
Figure SEQ Figure * ARABIC 31. Gamma Value High (Left) vs Gamma Value Low (Right)The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma, points far away from plausible separation line are considered in calculation for the separation line. Whereas high gamma means the points close to plausible line are considered in calculation.