The abstract in the concept lies in the proposal of an approach for learning that is self-supervised and is a representation of relationships between humans and their environment, that include interactions between object, attributes, and body pose, and are also entirely recorded from multiple viewpoints.
It consists of the training of an embedding with a triplet loss that in return contrasts a pair of simultaneous frames with temporally adjacent and visually similar frames from different viewpoints.
And this model is called as Time-Contrastive Networks (TCN) where the model can discover meaningful dimensions and attributes with the use of a contrastive signal that encourages it and can explain the changing state of objects and the world from similar frames while also learning invariance to a viewpoint, motion blur, occlusions, background, and lighting.
In other words, in relation to this video. In this work, robots can learn new tasks by watching a single third-person demonstration by a human along with an unstructured and unlabelled collection of videos besides this single video demonstration no supervision is provided to the system. There is a demonstration of the same approach on a diverse set of tasks with real and simulated robots:
A pouring task
A dish replacement task and
Opposed imitation task.
Talking about our two basic steps to the same approach:
The First step: Learn Representations:
The first step to the approach is to learn representations from the video using time as a supervision signal. As such here we use multiple synchronized viewpoints of the same scene as a rich signal for the discovery of different attributes in the world. This embedding as mentioned before is trained with a collection of videos that are unstructured and unlabeled.
These contain a positive demonstration of tasks but also random interactions in order to cover a general set of possible states in the world. The model represented in the video uses a triplet loss that is trained on a multi-viewpoint observation co-occurring frames from multiple viewpoints are attracted to each other in the embedding space while visually similar frames from nearby time steps in the same video are pulled apart.
This encourages the embedding to be invariant to viewpoint but sensitive to semantic cues that are indicative of time such as the weather liquid is pouring into a cup we can also consider a time contrast of model that is trained on only one view and this time the positive frame is randomly selected within a certain range of the anchor and a margin range is then computed given the positive range. Negatives are randomly chosen outside the margin range and the model is again trained as before.
The Second Step: Learning Policies:
The second step of our approach is to use reinforcement learning to learn policies on the top of TCN embeddings given a single third person demonstration the award function is constructed that function rewards following the progression of the video. At the semantic level, the robot arm initially tries random motions and then learns to reuse the controls yielding the highest rewards and then finally converges to reproducing the demonstrated tasks.
The model converges only after 9 iterations which is about 15 minutes of real-world training time. Similarly in the Dish Moving tasks the robot tries initially random motions and then learns to successfully pick up and move a plate, in particular, the opening and closing of the gripper at the appropriate time.
We also learn another way for robot control for the task of human pose imitation as before the robot learns an invariant TCN embedding by observing humans and itself without any correspondence labels then the robot learns to control its own body by training the output in it internal state when given an image of itself because the TCN embedding is invariant to robots and humans.
And the robot can then imitate humans instead of imitating itself and then the resulting imitation using a real robot is shown and it is observed that the robot was able to discover the mapping between its own body and a human body entirely on its own using TCN and self-regression.
Time-Contrastive Networks: Self-Supervised Learning from Video:
Video Source: Pierre Sermanet