Viola Jones Algorithm (Face and smile detection 2022)

camera detecting face

Have you ever wondered about how the square rectangles come up in handy cams and cameras, which detect your face, and some even have the feature of smile detection? How do the manufacturers of these image-capturing devices without using the heavy computational algorithms that use Artificial Intelligence manage to implement these features even in the cheapest cameras? Enter Viola-Jones algorithm, which helps us to address precisely this.

The hardware which I am using is an i3 processor, 4GB Ram system. Now for any real-world application of a face detection system, it should be optimized to be used in most of the commonly available systems, then only it is a viable option.
The Vilo-Jones algorithm for the frontal face detection system takes care of that. This face detection system is most clearly distinguished from previous approaches in detecting faces extremely rapidly. Operating on 384 by 288-pixel images, faces are detected at 15 frames per second on a conventional 700 MHz Intel Pentium III. We will be using the Python OpenCV library to work with the images.

Working of Viola-Jones Algorithm

The Viola-Jones algorithm consists of two stages. First is the training stage followed by the detection stage.

We start with the detection stage for ease of explanation.

The first step is to convert an RGB image to greyscale.

as shown below, for explanation purposes I am using my picture.

openCV grey Scaled image

Why greyscale?

Grey Scale has only one channel, RGB has three channels, we do not have to deal with 3 channels (Red, Blue, Green). Viola-Jones algorithms do not require these different channels to perform their task. We will work on a single channel whose values range from 0 -255, from white(0) to black(255), and in between numbers are different shades of grey.
The Algorithm starts looking for the face from the corner of the image. The image below shows a crude way of showing how this traversal is done along with the image.

Viola-Jones Algorithm traversal

Step-1

So during the traversal process algorithm looks for some features like the eyes, nose, eyebrows etc., in the red rectangle region. On the second row second column of the above image, we can see that the algorithm encounters eyes and eyebrows, but the algorithm also checks whether the region has a nose. Also, does the region shows two eyes and two eyebrows? If not, then the traversal continues.

Step-2

The algorithm traverses until it gets both eyebrows, eyes, nose and lips. This region is highlighted as a yellow region. I have taken big rectangles panning through ample space for traversing, but it is not so in the algorithm. The traversal step may be small, so when it detects a face with all the features, there will be the same features again for the next traversal.

That means more rectangles will overlap each other, and that region gives us the high proximity of being a face. Like this, when many boxes overlap the same region, It is highly probable for a face to be present in that region.

I would recommend that one read the paper on Viola-Jones Algorithm. We can view the document here.

How does the algorithm know what an eye, nose, the eyebrow is, etc.?.

Viola-Jones identified Haar Classifiers. Haar-like features are digital image features used in object detection. Earlier face detection was based on image pixel intensity; it was more computationally expensive since it includes all the channels.
A Haar-like feature considers adjacent rectangular regions at a specific location in a detection window, sums up the pixel intensities in each area and calculates the difference between these sums. This difference is then used to categorize subsections of an image.
The bottom figure shows some of the Haar-like features.

Haar Features

Haar-like features include edge features, line features, Four-rectangle features.

The line of lighter pixels is represented in the white space, and the line of darker features is described in the black space. These Haar areas are scale-able. When we look on a typical face, the eyebrow is darker than the forehead-edge feature, mouth’s middle region is darker compared to lips –line feature.

These Haar features are identified and trained on many images to determine the standard Haar features that make a face. The algorithm identifies these features in the picture. As shown below.

Haar Feature on Face

We will consider an edge feature for eyebrows here from the above image. So we get the bottom haar feature value with their pixel intensity values. Note the value is taken as a crude value of 0 for pure white and 1 for black.

We calculate the white pixel sum and take the average, same for dark pixels. Then we subtract the Dark pixel region with the white pixel region to get a value. With training on lots of pictures, we get a threshold value. If the value is above a threshold, we consider it an edge feature; like this, other features are tested.

For the pure edge feature, the black side will be 1, the white side is 0, and the difference is 1. So the difference in our picture should be greater than the threshold we get by training.

In our example, it would be.

Threshold = Dark pixels-white pixels=0.66 - 0.26=0.4

If the threshold is met, yes, this represents an edge feature. So it helps in identifying whether this feature is there or keep looking.

How does Viola-Jones Algorithm Achieve in the level of Performance?

As mentioned earlier for this algorithm to be practically used in real-time. The processing of the image should also be real-time and fast.

It uses some computation hack for achieving this. Some of the components we will be discussing now.

Integral Image:

As we discussed, the above method is costly based on computation. The larger the features, the more expensive is the calculation. There won’t be any single feature but lots of features in a single image. So we construct the integral image.

An integral image is the same size as the original image. Where the difference lies is in the calculation.

Now the above table represents the image, suppose our trained feature is the shaded region, now to calculate the threshold for this small image with dimension of the feature of 4 rows and 2 columns, we sum it out, the total comes to 409.

What would have been the case of a large image or feature is in the range of 10000 by 10000 pixels, live stream from a camera, the computer has the operation to add all the pixel values altogether. Computation will be hard for such situations. Here is the Hack – Integral Image.

For the same size image, values in the squares are the sum of all pixels above and on the left side.

Table 3

The operation is quite simple :

First, we identify the rectangle’s bottom right corner and subtract the value from the top right corner.Then we take the top left corner we add to it. Then we take the bottom left value, subtract it, and get the total of the feature.

We need to perform only four operations regardless of the size of the feature. If it’s a 10000 * 10000 image in the integral image, there is still 4 operations to be done. Hence, we save computation time, and this takes care of large features.

Training Classifiers:

This is the next component.

Training helps identify features and understand the threshold values (how to know the feature exists )

The algorithm shrinks the training image to 24 by 24 pixels. Then it looks for these different Haar features. It identifies which features are common for faces.

The image is shrink because a large image has different variations of haar features like 1 pixel by 1 pixel, for different features or 2 by 2 pixels, so there can be a lot of features.

But when we try to fit on a new image after training, we do not shrink the image to 24 by 24. What we do is, scale the features up.

training – scale the image down
fitting – scale the features up

Once many frontal faces are given for training, this algorithm will find out particular features are repeating in many faces, so it learns those features.

ADA BOOST

Boosting technique provides a solution by combining many weak learners into strong learners. The boosting process’s uniqueness concentrates on the training records that are hard to classify and over-represents in the next iteration’s training set.

AdaBoost is one of the most popular implementations of the boosting ensemble approach. It is adaptive because it assigns weights for base models (α) based on the model’s accuracy and changes weights of the training records (w) based on the accuracy of the prediction.

eq1

We consider features that can detect above 50% as weak learners. So this F(x) is a strong classifier based on these weak classifiers.

So in an image, there can be many features like 10000’s, so using this algorithm, we just need 1000 features (weak learners) that can accurately classify.

How we find the most important features:

let say we have 10 pics, 5 faces and 5 non-faces. You identified an important feature that we apply to all the images. It classifies 3 true positives, 3 true negatives, 2 false negatives and 2 as false positives.

So our second feature can complement what the first one cannot identify, like that the third one is complementing the first two features. They are leveraging the strengths of each feature.

ADA boost gives more weights where the errors were made. Now it will identify the feature that takes these errors in (complementing), so it finds features that correctly take these error images.

Viola-Jones algorithm uses another hack.

That is the concept of cascading.

It takes a sub-window.We look for the first feature. If it is present, we go further looking for other features. If not, we reject the sub-window, i.e. we are not taking other features.

It is present, then we find the second feature in the sub-window. If the second feature is not there, we reject the sub-window. If present, we go for the third feature.The advantage is that we are saving more computational time, and logically, if all the features that characterise a face are present, it can only be a face. Else for each feature, it had to traverse the entire image repeatedly.

Code Implementation:

				
					# Face Recognition
import cv2
#Loading the cascades face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
#We load the cascade for the face using cv2 method CascadeClassifier.
cv2.CascadeClassifier('haarcascade_eye.xml') 
# We load the cascade for the eyes.
def detect(gray, frame):
# We create a function that takes as input the image in black and white (gray) and the original image (frame), and that will return the same image with the detector rectangles.
     faces = face_cascade.detectMultiScale(gray, 1.3, 5)
     # We apply the detectMultiScale method from the face cascade to locate one or several faces in the image. 
     for (x, y, w, h) in faces:
        # For each detected face:x,y are coordinates of upper left corner of rectangle,w,h is the width and height of the rectangle
        cv2.rectangle(frame, (x, y), (x+w, y+h), (255, 0, 0), 2)
        # We paint a rectangle around the face,(x+w) (x+h) give the lower right corner of the rectangle.
        roi_gray = gray[y:y+h, x:x+w]
        # We get the region of interest in the black and white image.
        roi_color = frame[y:y+h, x:x+w]
        # We get the region of interest in the colored image.
        eyes = eye_cascade.detectMultiScale(roi_gray, 1.1, 3) 
        # We apply the detectMultiScale method to locate one or several eyes in the image.
        for (ex, ey, ew, eh) in eyes: # For each detected eye:
            cv2.rectangle(roi_color,(ex, ey),(ex+ew, ey+eh), (0, 255, 0), 2) 
            # We paint a rectangle around the eyes, but inside the referential of the face.
            smiles=smile_cascade.detectMultiScale(roi_gray,1.7,22)
            # similarly we do it for smile cascade
            for (sx,sy,sw,sh) in smiles:
                cv2.rectangle(roi_color,(sx,sy),(sx+sw,sy+sh),(0,0,255),2)
    # We return the image with the detector rectangles.
    return frame
video_capture = cv2.VideoCapture(0) 
# We turn the webcam on,if your using an external webcam try the value 1.
while True: # We repeat infinitely (until break):
    # We do some colour transformations.
    canvas = detect(gray, frame) # We get the output of our detect function.
    cv2.imshow('Video', canvas) # We display the outputs.
    if cv2.waitKey(1) & 0xFF == ord('q'): # If we type on the keyboard: 
        break # We stop the loop.
    video_capture.release() # We turn the webcam off.
    cv2.destroyAllWindows() 
    # We destroy all the windows inside which the images were displayed.

The Output

face detected

Conclusion:

We have built a face detection system using the Viola-Jones algorithm. We have to understand that we are not using any deep learning library here. We are using the OpenCV library for implementing frontal face detection. We also added cascades for detecting smiles. Several deep learning algorithms perform better in detecting frontal faces and smiles. The idea was to share a technique still widely used in detecting faces and smiles in cameras.

Silly Techy