A Scalable Meta-Learning Algorithm released by OpenAI: "Reptile"- (Includes An Interactive Tool To Test It On-Site.)
This paper considers problems that are related to Meta-learning, where there is a distribution of tasks, and we wish to obtain an agent that when presented with a previously unseen task sampled from this distribution learns quickly. In relation to the same, here is the presentation of Reptile: a remarkably simple meta-learning algorithm, which learns a parameter initialization that on a new task can be fine-tuned quickly.
This simple meta-learning algorithm called Reptile works by sampling a task repeatedly, performing stochastic gradient descent on it, and on that task updating the initial parameters towards the final parameters that were learned. This method performs as well as a meta-learning algorithm called MAML that is broadly applicable while being more computationally efficient as well as simpler to implement.
But, what is Meta-Learning?
Meta-learning is the process of learning how to actually learn. A meta-learning algorithm (where each task is a learning problem) takes in a distribution of tasks, and in return produces a quick learner that can generalize well from even a small number of examples.
In this paper that has been presented, You are put forward with as to how Reptile performs well on some well-established benchmarks for few-shot classification. You are also provided with some theoretical analysis that is mainly aimed at understanding why Reptile works.
How Does Reptile Work?
Reptile works by repeated sampling of a task, training on it, and then moving the initialization towards the trained weights on that task. Reptile doesn’t differentiate through the optimization process making it different from MAML, that as such also learns an initialization, hence, making it more suitable for optimization problems where there is a requirement of many update steps as Reptile simply on each task in a standard way performs stochastic gradient descent (SGD) — it does not unroll a computation graph or calculate any second derivatives. This makes Reptile take less computation as well as memory when compared to a MAML.
To further analyze why Reptile works, the update is approximated making the use of a Taylor series. It is shown that the update of Reptile, when corresponded to improved generalization from the same task maximizes the inner product between gradients of different mini-batches. This finding outside of the meta-learning setting may have implications for explaining the generalization properties of SGD. But the analysis made here suggests that Reptile and MAML perform a very similar update, including the same two terms with different weights.
Experiments conducted:
In the experiments conducted here, it is shown that Reptile and MAML on the Omniglot and Mini-ImageNet benchmarks yield similar performance for few-shot classification. Reptile, since the update has lower variance also converges to the solution faster.
Implementations:
Their implementation of Reptile is available on GitHub to which the link is mentioned below. It uses TensorFlow for the computations involved, and also for the replication of the experiments on Omniglot and Mini-ImageNet includes code. There will also soon be a release of a smaller JavaScript implementation that fine-tunes a model pre-trained with TensorFlow.
Discussion:
In problems that are related to meta-learning, it is assumed that to have access to a training set of tasks, which is further used to train a fast learner. They also describe an approach for meta-learning, that is surprisingly simple and which works by repeatedly optimizing on a single task, and moving the parameter vector towards the parameters learned on that task. This algorithm performs similarly to MAML, while also being significantly simpler to implement.
You are presented with two theoretical explanations for why Reptile works:
-
First, by approximating the update with a Taylor series, it is showed that the key leading-order term matches the gradient from MAML [FAL17]. This term adjusts the initial weights to maximize the dot product between the gradients of different mini batches on the same task—i.e., it encourages the gradients to generalize between mini batches of the same task.
-
You are also provided with another informal argument, which is that: Reptile finds a point that (in Euclidean distance) is close to all of the optimal solution manifolds of the training tasks.
For More Information(Implementations): Github
Link To The PDF: Click Here