This paper for heterogeneous hardware presents the design of a machine learning compiler: Glow(Graph-Lowering). This approach to compilation enables for multiple targets the generation of highly optimized code.
How Does Glow work?
Glow lowers into a two-phase strongly-typed intermediate representation. the traditional neural network data flow graph. The high-level intermediate representation gives the optimizer the allowance to perform domain-specific optimizations. The lower-level instruction-based address-only intermediate representation gives the compiler the ability to perform memory-related optimizations. At the lowest level, the optimizer to take advantage of specialized hardware features performs machine-specific code generation.
Glow features a lowering phase which enables the compiler to support a high number of input operators as well as a large number of hardware targets by eliminating the need to implement all operators on all targets and was designed to reduce the input space and allow new hardware backend to focus on a small number of linear algebra primitives.
The end of power saving due to Moore’s Law, combined with the increased demand for compute power driven by machine learning, has led to a wave of innovation in computer architecture.
Hennessy and Patterson present five principles that guide the design of machine-learning domain specific architectures (DSA):
dedicated local memories,
large numbers of arithmetic units,
simple forms of parallelism,
domain-specific programming models.
Compilers need to perform advance whole-graph optimizations in order to execute neural networks efficiently on DSAs. This paper describes some of these techniques.
Traditional machine learning frameworks iterate over the nodes in the graph and execute them one by one. Unfortunately, the node-visitor method of execution is inefficient, even on traditional processors. As a result, machine learning frameworks have started to hand over the graph to compilers that execute code more efficiently.
Based on the increasing importance of neural networks, the need for energy efficiency in data centres and mobile devices, and the design principles of domain-specific architectures, it is believed that the machine learning frameworks of the future will focus on providing attractive programming models on top of a layer that integrates compilers for many different targets.
In the Glow project, the focus is on the lower parts of the software stack. The work is done to provide PyTorch and other frameworks with a low-level graph and a code generator for neural networks. The name Glow is an abbreviation for Graph-Lowering, which is the main technique that the compiler uses for generating efficient code.
The Glow low-level graph will not replace the machine learning high-level graph, in the same way that the low-level intermediate representation in compilers does not replace the abstract syntax tree. The aim is to provide a useful compiler toolkit that will allow hardware developers to focus on implementing efficient acceleration hardware, each of which likely differs in capabilities, and use Glow for automating compilation tasks such as instruction selection, memory allocation and graph scheduling.
Summary: The Lifetime of a Glow Instruction
In this section of the paper, you are summarized with how instructions travel from the beginning of the compilation pipeline, and through the different levels of IR and to the backends. This is a high-level overview of the compilation process:
The graph is either loaded via the graph loader (from ONNX or Caffe2 format), or constructed via the C++ interface.
The graph is differentiated if needed.
The graph is optimized.
Linear algebra node lowering takes place.
Additional rounds of optimizations occur, both target independent and target specific.
The graph is scheduled into a linear sequence of nodes that minimizes memory usage.
IRGen converts the low-level graph into instructions.
Low-level IR optimizations are performed.
Backend-specific optimizations and code generation are performed.
The performance of Glow and TensorFlow1.7 is compared on three popular convolutional neural networks listed in the paper. The benchmarks were executed on a Kaby Lake Intel R Core i7-7567U (which does not support AVX-512) running on a single CPU core. Both TensorFlow and Glow were compiled to support the native architecture. TensorFlow was compiled with XLA enabled. They used the Keras library to supply and run pre-trained models for TensorFlow. Their benchmarks used a batch size of 8.
Performance (in frames per second) did not depend on the batch size, i.e. total execution time scaled linearly with batch size. Glow turns out to be up to 2.5x faster when taken TensorFlow into consideration.
For More Information: GitHub
Link To The Paper: Click Here