Video To Video Synthesis At 2K Resolution Using A Conditional GAN Framework.

Aug. 26, 2018, 12:52 p.m. By: Kirti Bakshi


In this work, you are presented to a Conditional Generative Adversarial Networks(GAN) Framework for video-to-video synthesis at 2K Resolution given an input video. For example, A video of semantic segmentation maps the network and transforms it into a photorealistic video as shown in the video.

Video-to-Video Synthesis:

Video Source: [Ting-Chun Wang]

It is to note that the frames are temporarily smooth. Starting from a video in some source domain, this technique synthesizes a new video in a target domain making the use of a learned network. As a result, this provides us with new tools for higher-level video manipulation and synthesis.

Also, in addition to synthesizing street scenes, they also show their network synthesizing videos for other domains.

Taking a few as examples:

  • Their network can transform edge map videos to videos of human faces. Hence, They show some examples of synthesized people when talking.

  • Not limiting itself here, the presented network can also generate different people speaking, given the same input edgemax. It is to note that the results are temporarily consistent depending from frame to frame.

  • They can also further synthesize the videos of people moving given pose information.

Semantic Manipulations:

The network can also synthesize multiple different results given the same input or the input can be manipulated in order to generate the desired output video. The video also puts forward a side by side comparison between the input and the synthesized output video to show the same.

Qualitative Comparisons:

Next, the comparison is made with two state-of-the-art methods:

  • Pix2pixHD: Method is applied directly frame by frame.

  • COVST(Coherent video style transfer): This adopts dimple temporal consistency constraints and performs better than the per frame from pix2picHD but the flickering still continues to persist.

Basic Technical Details:

The network has been trained making the use of a sequential generator and multi-scale discriminator.

Two type of discriminators are used:

  • One for images

  • One for video.

For information regarding the working and more, one can go through the video as well as further links mentioned at the end.


A general video-to-video synthesis framework based on conditional Generative Adversarial Networks (GANs) is presented. Through generator and discriminator architectures that are carefully designed when coupled with a spatiotemporal adversarial objective, on a diverse set of input formats, result in a photorealistic, high-resolution, temporally coherent video results. Extensive experiments further demonstrate that the results are significantly better than the results by state-of-the-art methods.

Finally, Its extension to the future video prediction task outperforms several state-of-the-art competing systems.

For More Information: GitHub

Link To The PDF: Click Here