Learning from Simulated and Unsupervised Images through Adversarial Training

Jan. 31, 2018, 9:47 a.m. By: Kirti Bakshi

Adversarial Training

With the recent progress that has taken place in graphics, it has become more tractable to potentially train models on synthetic images, avoiding any need for expensive annotations. However, due to a gap between synthetic and real image distributions learning from synthetic images may not achieve the desired performance. In order to reduce this gap, we propose (S+U) learning: Simulated+Unsupervised, where while preserving the annotation information from the simulator the task is to learn a model to improve using unlabeled real data, the realism of a simulator’s output. In this paper, it is developed a method for S+U learning that makes the use of an adversarial network that is similar to Generative Adversarial Networks (GANs) but instead of random vectors with synthetic images as inputs.

The paper to the standard GAN algorithm also makes several key modifications to avoid artifacts, stabilize training and preserve annotations:

  • A ‘self-regularization’ term,

  • A local adversarial loss,

  • Updating the discriminator using a history of refined images.

The paper shows that this enables generation of highly realistic images, which are demonstrated both with a user study and qualitatively. They by training models for gaze estimation and hand pose estimation quantitatively evaluate the generated images and also show a significant improvement over using synthetic images and achieve state-of-the-art results on the MPIIGaze dataset without any labelled real data.


Usage of real-world images to make simulated training data more useful for applications in real-world.

Insight into the paper:

With the recent rise in high capacity deep neural networks, Large labelled training datasets as we now know are becoming increasingly important. However, labelling such large datasets can be expensive as well as time-consuming. Thus, because the annotations are automatically available the idea of training on synthetic instead of real images has become very appealing. Making the use of synthetic data, more recently, Human pose estimation with Kinect and, a plethora of other tasks have been tackled.

However, due to a gap between synthetic and real image distributions learning from synthetic images can be problematic as often synthetic data is not realistic enough, leading the network to learn details that are only present in synthetic images and failing to generalize well on real images. One solution to closing this gap can be to improve the simulator.

However, computationally increasing the realism is often expensive. Also, the content modelling takes a lot of hard work, and even the best algorithms may still fail to model all the characteristics of real images. This lack of realism may cause models to overfit to ‘unrealistic’ details in the synthetic images. In this paper, it is proposed to you Simulated+Unsupervised learning (S+U), where from a simulator to improve the realism of synthetic images we make the use of unlabeled real data.


  • The paper proposes S+U learning that makes the use of an unlabeled real data in order to refine the synthetic images.

  • They train a refiner network to add realism to synthetic images using a combination of an adversarial loss and a self-regularization loss.

  • They make several key modifications to the GAN training framework to stabilize training and prevent the refiner network from producing artifacts.

  • They present qualitative, quantitative, and user study experiments showing that the proposed framework significantly improves the realism of the simulator output. They also achieve state-of-the-art results, without any human annotation effort, by training deep neural networks on the refined output images.


They evaluate the method on the MPIIGaze dataset for appearance-based gaze estimation in the wild, and on the NYU hand pose dataset of depth images hand pose estimation. There is also a use of fully convolutional refiner network with ResNet blocks for all of the experiments performed.

  • Appearance-based Gaze Estimation.

  • Hand Pose Estimation from Depth Images.

  • Ablation Study.

Conclusions and Future Work:

This paper proposed Simulated+Unsupervised learning to add realism to the simulator while preserving the annotations of the synthetic images. It also described SimGAN, their method for S+U learning, that uses an adversarial network and demonstrated state-of-the-art results without any labelled real data. In future, it is intended to explore modelling of the noise distribution in order to generate for each synthetic image more than one refined image, and also rather than single images investigate refining videos.

For More Information: GitHub

Link to the PDF: Click Here