Everybody Dance Now - Motion Retargeting Video Subjects [UC Berkeley].

Aug. 29, 2018, 1:48 p.m. By: Kirti Bakshi


This paper presents to you a simple method for “do as I do" motion transfer. They propose a method to transfer motion between human subjects in different videos. Making the use of pose detections as an intermediate representation between source and target, mapping from pose images to a target subject’s appearance is done.


A method to transfer motion between human subjects in different videos is proposed. Given two videos:

  • One of a target person whose appearance is to be synthesized,

  • The other of a source subject whose motion they wish to impose onto the target person.

The transfer motion between these subjects via an end-to-end pixel-based pipeline is done. This, over the last two decades, is in contrast to approaches which employ nearest neighbour search or retarget motion in 3D.

With this framework, there is a creation of a variety of videos, that enables untrained amateurs to spin like ballerinas, perform martial arts kicks or dance as pop stars. To transfer motion between two video subjects in a frame-by-frame manner, a mapping between images of the two individuals must be learned.

The main goal is therefore to discover an Image-To-Image Translation between the source and target sets. However, to supervise learning this translation directly they do not have corresponding pairs of images of the two subjects performing the same motions. Even if both subjects perform the same routine, it is still very unlikely to have an exact frame to frame body-pose correspondence due to body shape and stylistic differences that are unique to each subject.

The proposed method produces videos where motion without the need for expensive 3D or motion capture data is transferred between a variety of video subjects.

The main contributions are:

  • A learning-based pipeline for human motion transfer between videos,

  • And the quality of the results which demonstrate complex motion transfer in realistic and detailed videos.

They also conduct an ablation study on the components of the model comparing to a baseline framework.


  • Data Collection

  • Network Architecture

For more info regarding experiments and implementation, refer to the link mentioned in the end.


Summing up, the model is able to create reasonable and arbitrarily long videos of a target person dancing given body movements to follow through an input video of another subject dancing. Although the setup in many cases can produce plausible results, occasionally the results suffer from several issues as well.

Errors occur particularly in transfer videos when the input motion or motion speed is different from the movements that are seen at training time. However, even when the target subject attempts to copy a dance from a source subject in the training sequence, the results obtained still experience some jittering and shakiness when the motion from the source is transferred onto the target.

Since normalized poses for transfer are often similar to those seen in training, this observation is attributed to the underlying difference between how the target and transfer subjects when given their unique body structure move. In this way, they believe that motion is tied to identity which is still present in the pose detections.

Although the proposed method for global pose normalization resizes the movements of any source subject reasonably to match the scale and location of the target person as seen in training, the simple scale-and-translate solution does not account for different limb lengths and camera positions or angles. These discrepancies also contribute to a wider gap between the motion seen in training and at test time.

Additionally, the 2D coordinates and missing detections constrict the number of ways they are able to retarget motion between subjects, which often work in 3D with perfect joint locations and temporally coherent motions. To address these issues, more work is required to be done on temporally coherent video generation and on representations of human motion.

Despite these challenges, the method is able to produce compelling videos when given a variety of inputs.

Link To The PDF: Click Here

Everybody Dance Now:

Video Source: Caroline Chan