The Viola Jones algorithm was the first object detection framework which provides competitive object detection rates in real time. It was proposed by Paul Viola and Michael Jones in 2001, and the algorithm is hence named after both the contributors. It although can be trained for a variety of objects of different classes by pointing out some significant features, it was primarily focused on the problem of face detection.
The first step towards solving a large problem
Humans are naturally extremely efficient at recognizing faces, which also means we are also good at recognizing faces from a bundle of images and scenes. And yes we obviously take it for granted. The problem was realized when it was tried to implement the same on a computer which only sees an image in the matrices of 0/1 or 0-255 range. Viola Jones algorithm tackled the problem at the roots with a very basic yet effective approach.
The Viola Jones algorithm is extremely robust, has a very high detection rate and extremely few false- positive rate (of the order of 1 in 106), is fast enough to be implemented in real-time for practical applications involving frame rate of 2/sec, with the only drawback of only being used for face detection and not recognition. The algorithm has 4 stages of processing.
1. Haar Feature Selection: Haar features, in simple words, are the features extracted on the basis of contrast of an area on the face, converted to hard binary (for making the calculations less complex). All faces share some similar properties, like eyes being darker than the upper cheeks, nose being brighter than the eyes, the forehead being brighter than the eyes etc.
2. Creating the Integral Image: There are four features used for the initial process which basically serve as an average of the irregularities of the human faces from a standard one. These are
The eye region is darker than the cheeks and forehead.
The nose seems brighter than the eye in the images.
Location and size of the features, which are eyes mouth and nose.
Value of the gradient of pixel intensities.
Viola Jones usually uses a two- rectangle feature selection whose value will be defined as (sum of black pixels) – (sum of white pixels) till any pixel (x,y) after transforming the face image into feature image. This is the integral image, and this can be quickly computed in one pass through the image.
3. AdaBoost: AdaBoost is the learning algorithm which forms the crux of the Viola Jones face detection. AdaBoost creates a certain number of weak classifiers based on the features which will have a significant amount of error but will then ultimately end up creating a linear combination of all the weak classifiers to create one strong classifier which is robust.
The hypothesis is defined as: C(x)= θ(Σht(x) +b)
where ht(x) has a value of either 1 or -1 depending on the classification of being a face or a non-face. This ht(x) is the weak classifier which is multiplied by a weight matrix θ. ‘b’ is a constant term.
The advantage of this approach is that the training error quickly converges to 0 and is extremely simple to implement as well with high real- time accuracy.
4. Cascading Classifiers: Now once the classifier is ready, a number of classifiers are stacked one after the other to reduce the effective number of false positive cases for a better accuracy of the model. Different types of classifiers are chosen, based on the number of features they choose to differentiate between different samples and then are placed after one another. For e.g. If we choose 3 classifiers which filter the results one after the other with false positive rates of 50%, 40%, and 20%, then the net effective false positive rate of the whole model will be 2% only. Thus, cascading helps to reduce the number of false results from the model.
This is the gist of the Viola Jones face detection algorithm. For a deeper understanding of the training and cascading the classifier, you can click here.