Pomegranate v0.9.0 now released: Flexible probabilistic modelling in Python

Jan. 10, 2018, 9:42 a.m. By: Kirti Bakshi

Pomegranate

We all know that today the Python ecosystem is becoming increasingly popular for both the processing as well as the analysis of data.

And Pomegranate is an open-source machine learning package that is mainly for probabilistic and graphical models in Python, that has been implemented in cython for speed. It basically grew out of the YAHMM package, where many of the components that were used could be again rearranged to do many other things.

Pomegranate is a python package which aims at the implementation of fast, efficient, as well as extremely flexible probabilistic models and was in fact designed to be easy to use while on the same hand also not sacrificing on computational efficiency. These Models can either be specified by writing out each of the components individually if known beforehand, or even learned directly from data if not.

When we are talking of the most basic level of probabilistic modelling, it is a simple probability distribution and if we’re modelling language as such, a person can say that this may be a simple distribution over the frequency of all possible worlds.

Pomegranate also supports both parallelized model fitting and model predictions, both in a data-parallel manner and this is one of the key features of the package as well. Since the backend has been written in cython for speed, the global interpreter lock (GIL) can be released and multi-threaded training can be supported with the help of joblib. This simply means that parallelization is the utilized time that isn’t spent piping data from one process to another and nor are multiple copies of the model made.

Pomegranate Version v0.9.0:

This new version of pomegranate currently supports:

  • Probability Distributions

  • General Mixture Models

  • Hidden Markov Models

  • Naive Bayes

  • Bayes Classifiers

  • Markov Chains

  • Discrete Bayesian Networks

And in order support the above algorithms, it has efficient implementations of the below mentioned:

  • Kmeans

  • Factor Graphs

One important thing to note is that pomegranate does not yet work with networkx 2.0. If one has any problems, they can downgrade networkx and then try again.

Dependencies:

The requirements of pomegranate are:

  • Cython (only if building from source)

  • NumPy

  • SciPy

  • NetworkX

  • joblib

Also, in order to run the tests, one must also have nose installed.

Since the open-source package is of modular nature, it simply means that one can now use missing value support in conjunction with any of the other features. For example, one can easily add multi-threading in order to speed up models, or do out-of-core learning with incomplete data sets, or in fact, also have both missing data and missing labels to do semi-supervised learning with missing data as well!

One can easily install pomegranate either by cloning the GitHub repo or with pip install pomegranate. Wheels should be built for all platforms soon so that you don't even need to deal with Cython but some issues have delayed that as of now.

Highlights of this version:

  • Missing value support has been added in for all models except factor graphs. This is done by the inclusion of the string nan in string datasets or numpy.nan in numeric datasets. Model fitting and inference is supported for all models for this. The basic technique is to not collect sufficient statistics from missing data, not to impute the missing values.

  • The unit testing suite has also been greatly expanded, from around 140 tests to around 370 tests.

The link to the preprint of the pomegranate paper: Click Here