Machine Learning for Apache Spark providing a whole New Range of Tools

Oct. 19, 2017, 2:56 p.m. By: Vishakha Jha

Machine Learning

When I think about mankind and innovation, I believe yearning for more is the root cause that leads to the transformation of innovation into inventions. The aim is always to offer better, more reliable and efficient products within minimum expected cost to the consumer. And with the same thought process, Microsoft stepped forward launching Library for Distributed Deep Learning on Spark which is an enhancement to existing Spark ML, which is expected to lead us towards a better road. Apache Spark is basically an open-source processing framework that provides high-performance querying on big data. It works on large-scale data analytics applications. But there is always a glitch that leads us to the path of improvement, similarly, the SparkML gained high customer satisfaction due its features but still struggled with low-level API. In order to turn the tables, Microsoft came with the Machine Learning API for Spark as the DataFrame-based API.

Microsoft Machine Learning for Apache Spark (MMLSpark) allows you to conduct a number of tasks with greater ease which incorporates common tasks that builds models in PySpark which leads you to enhance your productivity. It improves the rate of experimentation with a better set of machine learning techniques. MMLSpark also provides us with a whole new range of data science and deep learning tools for Apache Spark. The features provide users with an opportunity to develop scalable and powerful predictive model through the coherent alliance of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit and OpenCV.

To feature and train a model for some data through vanilla SparkML one would have to go through a number of steps that includes loads of code which is not even modular because of the various requirements and choices that you make. Whereas in MMLSpark all you need to do is to pass your collected data to the model rest is done by the library itself. DataFrames are used by MMLSpark due to its datatype and compatibility with Python APIs, it provides us with a higher level of modularized work. Through these API one can work on image analysis and computer vision pipelines that use the cutting-edge DNN algorithms. There are certain requirements associated with MMLSpark which includes Spark 2.1+, Scala 2.11 and either Python 2.7 or Python. 3.5+.

Key Features provided by MMLSpark includes:

  • Easily accessing images from HDFS into Spark DataFrame

  • Using transforms from OpenCV helps in Pre-processing the image

  • Handling pre-trained bidirectional LSTMs from Keras for medical entity extraction

  • Training a GPU node through Spark worker nodes and providing data to GPU VM for a scalable scoring.

  • Scalable image processing pipelines through OpenCV to read and prepare your data

  • Free-form text data through suitable APIs along with utilizing primitives in SparkML

  • Implicit featurization of data enables ease in training of classification and regression models

To make this new version of Spark more approachable and available to customers Microsoft has released MMLSpark as an Open Source project on GitHub. It allows users to explore and access easily and turns out to be something great for all tech geeks.