DensePose: Dense Human Pose Estimation In The Wild

Feb. 17, 2018, 9:26 a.m. By: Kirti Bakshi


Dense pose estimation aims at mapping all human pixels of an RGB image to the 3D surface of the human body. Here, you are introduced to DensePose-COCO, a large-scale ground-truth dataset with image-to-surface correspondences that are manually annotated on 50K COCO images and to densely regress part-specific UV coordinates within every human region at multiple frames per second train DensePose-RCNN.

In this work:

Here, in this work, it is bought a task that is referred to as dense human pose estimation where they establish dense correspondences between an RGB image and a surface-based representation of the human body. The first step is by introducing an efficient annotation pipeline to gather dense correspondences for 50K persons appearing in the COCO dataset.

They then make the use of the dataset namely in the presence of background, occlusions and scale variations in order to train CNN-based systems that deliver dense correspondence ‘in the wild’. The training set’s effectiveness is improved by training an ‘inpainting’ network that can fill in ground truth values that were missing, and then with respect to the best results that would be achievable in the past, report clear improvements.

The Experiment is done with region-based models and fully convolutional networks and on observation of a superiority of the latter; there is a further improvement of accuracy through cascading, obtaining a system that delivers highly-accurate results in real time.

Insight into the work and methodology:

This work, by establishing dense correspondences from a 2D image to a 3D, surface-based representation of the human body as mentioned before aims at pushing the envelope of human understanding further in images. This task can be understood as involving several other problems, such as pose estimation, object detection, part and instance segmentation either as special cases or prerequisites.

Addressing this task could also be a stepping stone towards general 3D-based object understanding and has applications in problems that require going beyond plain landmark localization, such as augmented reality, graphics, or human-computer interaction.

The task of establishing a dense correspondence from an image to a surface-based model has been addressed mostly in the setting where a depth sensor has been made available. By contrast, in this very case, it is taken into consideration as input a single RGB image, based on which there is established a correspondence between surface points and image pixels.

While many of the works today are aiming at general categories, this work presented here is mainly focused on the most important visual category arguably, that is humans. For only humans, by exploiting parametric deformable surface models one can simplify the task.

But how does this differ from others?

This methodology differs from all the works as here they take a supervised learning approach that is full-blown and gather ground-truth correspondences between images and a detailed, accurate parametric surface model of the human body.

A summarization of their contributions:

  • Firstly, by gathering dense correspondences between the SMPL model and persons appearing in the COCO dataset the introduction of the first manually-collected ground truth dataset for the task.

  • The usage of the resulting dataset by regressing body surface coordinates at any image pixel to then train CNN-based systems that deliver dense correspondence ‘in the wild’. The experiment is done observing a superiority of region-based models over fully-Convolutional networks with both fully-Convolutional architectures, that rely on Deeplab and also with region-based systems, that rely on MaskRCNN.

  • Thirdly, in this work, they also explore different ways of exploiting their own constructed ground truth information. Their supervision signal over a randomly chosen subset of image pixels is defined per training sample.

The experiments that took place here even though have their own space for improvement indicate that to a large extent dense human pose estimation is feasible. The paper is then concluded with some directions and qualitative results that show the potential of the method used.


In this work using discriminative trained models, they have tackled the task of dense human pose estimation. There is also the introduction of a large-scale dataset of ground-truth image-surface correspondences and developed novel architectures that allow us to recover highly-accurate dense correspondences between images and the body surface in multiple frames per second: COCO-DensePose.

It is, therefore, anticipated that this will not only pave the way both for downstream tasks in augmented reality or graphics but also help them tackle the general problem of associating images with semantic 3D object representations.

For more insight into the paper, one can go through the links given below:

Link to the PDF: Click Here.


DensePose: Dense Human Pose Estimation In The Wild

Video Source: Alp Guler