Auto-tuning data science: New research streamlines machine learning

Dec. 25, 2017, 3:04 a.m. By: Kirti Bakshi


ATM(Auto Tune Models) is an open source software library that is under "The human data interaction project" at MIT. It is a distributed as well as a scalable AutoML system that has been designed with ease of use in mind. ATM takes in data along with pre-extracted feature vectors and labels (target column) in a simple CSV file format and also attempts to learn several classifiers (machine learning models to predict the label) in parallel. In the end, ATM then returns a number of classifiers as well as the best classifier with a specified set of hyperparameters.

Moving onto its working:

The ATM system works a little differently, using on-demand cloud computing in order to generate as well as compare hundreds or even thousands of models overnight. To search through techniques, researchers make the use of an intelligent selection mechanism. The system tests a number of models in parallel, evaluates each of them, and then allocates more computational resources to those techniques that put forward promise. While the poor solutions fall by the side, the best options make their way to the top.

Rather than blindly choosing the “best” option and providing it back to the user, ATM helps display the results as a distribution, therefore, allowing for comparison of different methods alongside. In this very way, ATM speeds up the process of testing as well as comparing different modelling approaches without the automating out of human intuition, which remains to be a vital part of the data science process.

Testing And Results:

Researchers moved on further to test the system against humans with the help of the collaborative crowdsourcing platform A platform on which data scientists work together in order to solve problems. ATM analyzed about 371 datasets from the platform and the researchers found that the system was able to come up with a solution that was better than the one humans had developed 30% of the time.

ATM also turned out to work much more quickly as compared to the humans: It took human open-ml users an average of 200 days to deliver a solution, while it took ATM less than a day to create a better-performing model in comparison to the same.

"ATM can, therefore, augment the work of data scientists, and offer them more peace of mind when they are selecting the right model", Arun Ross, a senior author on the paper and also a professor in the Computer Science and Engineering department at Michigan State University, told MIT News.

Current status of the System:

atm and the accompanying library btb are under active development in the transitioning from an older system to a new one. In the next couple of weeks, it's expected that there will be an update its documentation, its testing infrastructure, as well as provide APIs and establish a framework for the community where they can contribute.

ATM as of now has the following mentioned features:

  • It provides the users with the ability to simultaneously run the system for multiple datasets.

  • Users can run on AWS or a cluster compute as well.

  • It makes use of a variety of AutoML approaches for the purpose of tuning as well as selection that is available in the accompanying library btb

  • ATM also stores models, metrics and cross validated accuracy information about each classifier it has learnt. There are a number of ways in which a user can use the system and most of it is controlled through the three yaml files that are present in conig/templates/. The provided documentation is said to cover all these scenarios and settings within it.

To that end, the researchers have decided to open-source ATM, and make it available to the enterprises who wish to use it. There has also been an inclusion of provisions that allow the researchers to integrate new model selection techniques and thus continuously improve on the platform as well.

ATM can run on local computing clusters, on-demand clusters or on a single machine in the cloud, and can also work with multiple data sets as well as multiple users simultaneously.

"A small-to medium-sized data science team can set up and start the production of models with just a few steps," said Veeramachaneni. And surely none of those is followed by a "what-if."

For More Information: GitHub