VGG 16- An Advanced Approach Towards Accurate Large Scale Image Recognition

Sept. 6, 2017, 11:14 a.m. By: Prakarsh Saxena

VGG 16

There are numerous image recognition algorithms in the Computer Science world and regular efforts are put in to make the machines more intelligent which could identify objects with an accuracy close to, or even greater (Superhuman Accuracy) than what humans have. One such approach is the VGG Network which is based on Convolutional Neural Networks (CNNs), which was a significant milestone in achieving the objective.

VGG 16- A Highly Accurate Image Recognition Algorithm

The VGG Network architecture was introduced by Simonyan and Zisserman in their 2014 paper, Very Deep Convolutional Networks for Large Scale Image Recognition. The network is characterized by the fact that it uses a simple 3x3 convolutional layer stack, on top of each other in increasing depth. The number ‘16’ represents the number of weight layers of the Neural Network. The Keras model of the network was used by the VGG Team in the ILSVRC 2014 competition, and at that time, a 16 – layered neural network was considered very deep.

During training, the input to the CNN is a fixed size 224 x 224RGB image. The mean RGB value is subtracted from each pixel of the images computed on the training set. The image is then passed through a collection of convolutional layers, where filters with a very small receptive field- typically 3 x 3- are used. In the VGG 16 Model, the convolution stride was fixed to 1 pixel; the spatial padding of convolutional layer input was such that the spatial resolution was preserved after convolution, i.e. the padding was 1 pixel for 3 × 3 conv. layers. Spatial pooling was carried out by five max-pooling layers, which follow some of the convolutional layers, where not all the convolutional layers were followed by max-pooling. Max-pooling was performed over a 2 × 2 pixel window, with stride of 2. A stack of convolutional layers was followed by three Fully-Connected (FC) layers: the first two had 4096 channels each, the third performed 1000- way ILSVRC classification and thus contained 1000 channels (one for each class). The final layer was the soft-max layer. The configuration of the fully connected layers was the same in all networks. All hidden layers were equipped with the rectification non-linearity.

The training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent with momentum (or the acceleration constant). The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5). The learning rate was initially set to 10−2 and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs). To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration).

This model is available for both the Theano and TensorFlow backend, and can be built both with "channels_first" data format (channels, height, width) or "channels_last" data format (height, width, channels).

During the testing, it was observed that VGG 16 was extremely challenging to train, especially on a large data where the convergence of the deeper networks come into question. They then had to train smaller versions of VGG with less weight layers in order to make the training easier. The smaller networks, when converged, were then used as initializations for the larger deeper networks- a process which is termed as pre-training.

The VGG 16 model works extremely well in terms of accuracy. The network achieves an astounding accuracy of 92.7% accuracy in the top- 5 test accuracy in ImageNet, which is a huge dataset of over 14 Million images classified into 1000 categories.

Although it is quite obvious, pre- training is a time taking process and creates a redundancy of training the pieces of networks first before it can be served wholly for the purpose after being used as the deeper network. This concludes to major drawbacks of the VGG16 Network:

  • It is extremely slow to train and consumes a lot of time and resources of the machine.

  • The network architecture’s weights themselves are quite large (in terms of the disk space/ bandwidth covered)

This makes the VGG16 Network of a size of around 533MB. It’s quite clear- deploying VGG is a tedious task. Yet, VGG is still used in many deep learning image classification problems because of its high accuracy.

VGG 16 model paved way for more advanced neural networks like ResNet architecture, which has been successfully trained at depths of around 50-200 for ImageNet, and over 1000 layers for CIFAR-10 dataset. To learn more about the network and how the research on it was done, you can read the paper here.