YOLOv3: An Incremental Improvement

March 31, 2018, 2:47 p.m. By: Kirti Bakshi


YOLO, the abbreviated form of You Only Look Once that came up in the year 2016 was put forward with a new approach that aimed at solving the object detection problem. Before YOLO came up, all the object detection models had to perform a type of detection and then on top of the detected ROI’s (Region of Interest), classification would be done.

But this was framed as a regression problem by YOLO and it, therefore, using a single neural network tried to perform detection as well as classification. They trained this end to end network for the detection performance by optimizing it. From then onwards many new ways or neural networks tried to solve the object detection problem but no one was faster when compared to YOLO but it had some drawbacks as well which got solved in the next version YOLOv2 and YOLOv3. Like:

  • lower MAP(Mean Average Precision)

  • localization errors.

Coming to YOLOv3:

YOLOv3 comes out to be both extremely fast and accurate. In MAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster and this is what brought the fast YOLOv2 at par with best accuracies. Moreover, simply just by changing the size of the model you can easily tradeoff between speed and accuracy, hence no retraining required!

YOLOv3 gives a MAP of 57.9 on COCO dataset for IOU 0.5 and the table below shows the comparisons:


But What Exactly Changed And What Do We Mean By The So-Called Incremental Improvements?

Bounding Box Predictions:

Just like YOLOv2, YOLOv3, in order to generate Anchor Boxes, makes the use of dimension clusters. As YOLOv3 is a single network, the loss for classification and objectiveness needs to be calculated separately but from the same network. YOLOv3 making the use of logistic regression predicts the objectiveness score where 1 means complete overlap of bounding box prior over the ground truth object.

It, unlike Faster RCNN, will predict only 1 bonding box prior for one ground truth object and any error in this would incur for both classification as well as detection (objectiveness) loss. There would also be other bounding box priors which would have objectiveness score more than the threshold but less than the best one, for these errors will only incur for the detection loss and not for the classification loss.

Class Predictions:

YOLOv3 for each class instead of a regular softmax layer makes the use of independent logistic classifiers. This is done to make the classification multi-label classification.

Predictions across scales:

In order to support detection, an varying scales YOLOv3 predicts boxes at 3 different scales. Then from each scale features are extracted by using a method that is similar to that of feature pyramid networks.

Feature Extractor:

YOLOv2 as its backbone feature extractor made the use of Darknet-19, and here, YOLOv3 makes the use of a new network- Darknet-53! Darknet-53 is provided with 53 Convolutional layers, and is deeper than YOLOv2 and it also has residuals or shortcut connections. It's more powerful when compared to Darknet -19 is also and more efficient than both ResNet-101 or ResNet-152.

What Are The Improvements?

  • Improvement in the average precision for small objects, it is now comparatively better than Faster RCNN but still, Retinanet is better in this.

  • With the increase in MAP, there was a decrease in the localization errors.

  • Predictions for the same object at different scales or aspect ratios improved because of the addition of feature pyramid-like method, a method that should have been named.

  • And, a significant increase in MAP.

So, How Do We Conclude?

YOLOv3 is fast, efficient and has at par accuracy with best two stage detectors (on 0.5 IOU) and this makes it an object detection model that is very powerful. Applications of Object Detection in domains like robotics, retail, manufacturing, media, etc need the models to be very fast keeping in mind a little compromise when it comes to accuracy but YOLOv3 is also very accurate.

This makes it the best model to choose in these kind of applications where speed is important either because:

  • The products need to be real-time or

  • The data is just too big.

For More Information: YOLO

PDF: Click Here


Video Source: SuperDragon McFuzzypants