Horovod: An Open Source Distributed Deep Learning Framework for TensorFlow by Uber
We all know the advances and the tremendous progress that has taken place in Deep Learning over the past few years. Whether it be image processing, forecasting or Speech Recognition, it has had its magic everywhere. And Uber? It aims at applying this very knowledge and concept into their Business, in areas that vary from self-driving research to trip forecasting or fraud prevention, deep learning has helped them in enabling their engineers as well as data scientists in the creation of better experiences for all their users.
TensorFlow, an open source software library for Machine Intelligence is becoming an increasingly preferred deep learning library at Uber for a number of reasons. To begin with, the framework is known to be one of the most widely used open source frameworks for deep learning, which makes it easier for all the new users because of the platform they are provided with. It is a combination of high performance along with low-level model details. In addition to this, TensorFlow also has an end-to-end support for a wide variety of deep learning use cases whether it be from conducting a research to deploying models in production on cloud servers, or even self-driving vehicles in that case.
In the month of September, Uber Engineering pulled back the curtain on an internal Machine Learning-as-a-service platform codenamed as Michelangelo that aimed at making machine learning very accessible and also made the deployment of these systems much easier to build as well.
Now, Here, you are presented with Horovod, an open source distributed training framework for TensorFlow that is a component of the deep learning toolkit of Michelangelo and which comes with a goal of making distributed Deep Learning fast and easier to use. Now, How did this come up? With the current requirements of Uber, and after trying out various methods to resolve the challenges they were currently facing and will continue to do so, they decided to work upon their own implementation's to address the needs of Uber and hence came to adopt Baidu’s draft implementation of the TensorFlow ring- allreduce algorithm and then converted the code into a stand-alone Python package that was then named as Harovod, that also received its name from a Russian folk dance in which the dancers performed with their arms that were linked in a circle, just like how the distributed Tenser flow processes used Horovod to communicate with each other.
They then replaced the Baidu's ring-all reduce implementation with NCCL(NVIDIA's collective communication Library) that is NVIDIA’s library for collective communication which provides a version of ring-all reduce that is highly optimized. Soon after, NCCL 2 with its ability to run ring- allreduce across multiple machines, enabled them even further to take advantage of its many boosting optimizations related to performance. Following the procedure, they also added support for models that fit inside a single server, potentially on multiple GPUs. And then, Finally, they made several improvements in their API inspired by feedbacks that they received from their initial users of the same. This new API allowed them in bringing down the number of operations that a user had to introduce to their single GPU program to four.
Uber uses this very software to run training models for deep learning tasks that run GPU'S in hundreds, for research and ultimately acts as a guidance for self-driving cars, fraud detection and image classification. This Training of Deep learning involves feedback among nodes and hence requires all the nodes to be operational at the same time. When a developer uses Horovod, a library call is included within the program, and, during run-time, a software agent launches the number of copies that are required to run the application.
One of the unique properties of Horovod includes its ability to interleave communication and computation coupled along with the ability to batch small allreduce operations, which in return results in improved performance. A batching feature which is called as Tensor Fusion.
And what's more? You can also use Horovod to your advantage for your team’s machine learning use cases, too!
Github: Horovod
Image Source: Uber Engineering Blog