Mask R-CNN: Mask R-CNN For Object Detection And Instance Segmentation On Keras And Tensorflow.

July 1, 2018, 3:13 a.m. By: Kirti Bakshi

Mask R-CNN

You are presented for object instance segmentation a conceptually simple, flexible, and general framework. The approach looked at here while simultaneously generating a high-quality segmentation mask for each instance is efficient enough to detect objects in an image.

This method, named Mask R-CNN, by addition of a branch in order to predict an object mask in parallel with the existing branch for bounding box recognition extends Faster R-CNN. Mask R-CNN, running at 5 fps is relatively simple to train and adds only a small overhead to Faster R-CNN.

Above all, Mask R-CNN is easy to generalize to other tasks, for example: Allowing the estimation of human poses in the same framework.

The results on the top are put ahead in all three tracks of the COCO suite of challenges, that include the following:

  • Instance Segmentation,

  • Bounding box Object Detection,

  • Person Key Point Detection.

Mask R-CNN, including the COCO 2016 challenge winners outperforms all existing, single-model entries on every task.


The vision community over a short period of time has rapidly improved object detection as well as semantic segmentation results. In a huge part, these advances have been driven by powerful baseline systems, for object detection and semantic segmentation, respectively such as the Fast/Faster RCNN and Fully Convolutional Network (FCN) frameworks. These methods together with fast training and inference time are conceptually intuitive and offer flexibility as well as robustness.

Their Main Goal:

Their main goal in this work is to develop for instance segmentation a comparably enabling framework. which is challenging because it while also precisely segmenting each instance requires the correct detection of all objects in an image.

However, you are shown that a surprisingly simple, flexible, and a fast system can surpass prior state-of-the-art instance segmentation results. Hence presenting their method, called Mask.


Since Mask R-CNN when given the Faster R-CNN framework turns out to be pretty simple to implement as well as train, it, as a result, facilitates a wide range of flexible architecture designs.

Mask R-CNN in principle is an intuitive extension of Faster R-CNN, yet for good results the construction of the mask branch properly is critical. Most importantly, Faster RCNN was not designed for alignment of pixel-to-pixel between network inputs and outputs. In order to fix the misalignment caused, in this paper, they propose RoIAlign: A layer that is simple, quantization-free and preserves exact spatial locations faithfully.

Related Work:

  • R-CNN

  • Instance Segmentation

Mask R-CNN:

If looked at conceptually, Mask R-CNN is pretty simple: Faster R-CNN for each candidate object has

  • Two Outputs,

  • A Class Label

  • A Bounding-Box Offset;

To this, there is an addition of a third branch that outputs the object mask. Mask R-CNN thus turns out to be a natural and intuitive idea.

But the additional mask output is distinct from the class and box outputs, therefore, requiring extraction of a much finer spatial layout of an object. Next, you are introduced to the key elements of Mask R-CNN, including, the main missing piece of Fast/Faster R-CNN: pixel-to-pixel alignment.

It is hoped that this simple and effective approach presented in the paper will serve as a solid baseline in instance-level recognition and help ease any future research in the area.

This was only a brief insight into the paper. For more information regarding the same, go through the links mentioned below.

Image Source And Information: GitHub

Link To The PDF: Click Here