Simple Tensorflow Implementation Of Multimodal Unsupervised Image-To-Image Translation (MUNIT)

April 26, 2018, 9:21 p.m. By: Kirti Bakshi


In the field of computer vision, Unsupervised image-to-image translation is an important and challenging problem. Given an image in the source domain, the goal is to, without seeing any examples of corresponding image pairs learn the conditional distribution of corresponding images in the target domain. While this conditional distribution is inherently multimodal, the existing approaches, by modelling it as a deterministic one-to-one mapping make an assumption that is overly simplified. As a result, they fail to, from a given source domain image generate diverse outputs. To address this limitation, this paper proposes a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework.

It is assumed that the image representation can into a domain-invariant content code be decomposed, and a style code that captures properties that are domain-specific. To translate an image to another domain, its content code is recombined with a random style code sampled from the style space of the target domain. The framework that is proposed here is analyzed and establishes several theoretical results. Further, the advantage of the proposed framework is demonstrated through Extensive experiments with comparisons to state-of-the-art approaches. Moreover, this framework allows users to by providing an example style image control the style of translation outputs.

More Information:

In computer vision, many problems aim at translating images from one domain to another, that include:

  • Inpainting

  • Super-resolution

  • Attribute transfer

  • Atyle transfer

  • Colourization

This cross-domain image-to-image translation setting has therefore received significant attention. When the dataset contains paired examples, this problem can be approached by a conditional generative model or a simple regression model. In this work, when such supervision is unavailable the focus is on the much more challenging setting. The cross-domain mapping of interest is multimodal in many cases. Unfortunately, existing techniques usually assume a deterministic or unimodal mapping. As a result, they fail to capture the full distribution of possible outputs. , The network usually learns to ignore even if the model is made stochastic by injecting noise.

In this paper, a principled framework for the Multimodal Unsupervised Image-to-image Translation (MUNIT) problem is proposed. First, we assume that the latent space of images can be decomposed into a content space and a style space and then further assume that images in different domains share a common content space but not the style space. To translate an image to the target domain, later they then recombine its content code with a random style code in the target style space.

While the style code represents remaining variations that are not contained in the input image The content code encodes the information that should be preserved during translation. By sampling different style codes, our model is able to produce diverse and multimodal outputs. Extensive experiments demonstrate the effectiveness of our method in modelling multimodal output distributions and its superior image quality compared with state-of-the-art approaches. Above all, the decomposition of content and style spaces allow this framework to perform example-guided image translation, in which a user-provided example image in the target domain controls the style of the translation outputs.

Related Work:

  • Generative adversarial networks (GANs): This framework in image generation has achieved impressive results.

  • Image-to-image translation: Recent studies have also attempted to without supervision learn image translation. This problem is inherently ill-posed and requires additional constraints.

  • Style transfer: Style transfer while preserving its content, aims at modifying the style of an image which is closely related to image-to-image translation. Here, a distinction is made between example-guided style transfer, in which the target style comes from a single example, and collection style transfer, in which the target style is defined by a collection of images.

  • Learning disentangled representations: This work draws inspiration from recent works on disentangled representation learning.

Some other works focus on disentangling content from style. Although it is difficult to define content/style and different works use different definitions, “content” is referred to as the underlying spatial structure and “style” as the rendering of the structure. In this setting, there are two domains that share the same content distribution but have different style distributions.


This paper put forth a framework for multimodal unsupervised image-to-image translation. The model achieves quality and diversity superior to existing unsupervised methods and comparable to state-of-the-art supervised approach. Future work includes extending this framework to other domains, such as videos and text.

For more information and an in depth knowledge, go through the link below:

Code and pre-trained models: Click Here

For More Information: GitHub

PDF: Click Here