Industry's fastest inference implementation: Presenting to you the New version of CatBoost gradient boosting library

Feb. 12, 2018, 2:27 p.m. By: Kirti Bakshi

CatBoost

CatBoost that is basically a machine learning method is an open-source gradient boosting over decision trees library with the help of categorical features that support out of the box for Python, R.

Or in other words, CatBoost is basically an open-source that is based on gradient boosting over decision trees.

Now, let's look at the Main advantages of CatBoost:

  • When compared with other GBDT libraries, CatBoost results in Superior quality.

  • Known to be Best, when it comes to class inference speed.

  • Provides Support for both numerical as well as categorical features.

  • For the purpose of training, Fast GPU and multi-GPU (on one node) support.

  • The Inclusion of Data visualization tools.

More Insight into CatBoost:

When it comes to the version 0.6 of CatBoost, it provides itself with a lot of speedups as well as improvements, The most valuable improvement at the moment being the release of industry fastest inference implementation.

We can surely say that CatBoost is the Best when we talk of class inference as well as a ton of speedups.

Moving on further, CatBoost makes the use of oblivious trees as base predictors. Each leaf index in oblivious trees can be encoded as a binary vector where the length of the tree is equal it's depth. This fact is widely made use of in CatBoost model evaluator:

  • We first binarize all used float features, statistics and one-hot encoded features.

  • Then, after the same make use of binary features in order to calculate model predictions.

Thee vectors can be built in a data-parallel manner with SSE intrinsics.

Speedups of the Open-source method:

The team of CatBoost has spent a lot of effort when it comes to the speedup of different parts of library. As of now, the list is below:

  • When training on large datasets, a speedup of 43%.

  • For the purpose of QueryRMSE and calculation of query wise metrics, a speedup of 15%.

  • Large speedups when making the use of binary categorical features.

  • Significant speedup (x200 on 5k trees and 50k lines dataset) for plot and stage predict calculations in cmdline.

  • Compilation time speedup.

Major Features And Improvements of CatBoost:

  • Bringing of the Industry fastest applier implementation.

  • Introducing itself to us with new parameter boosting-type in order to switch between dynamic boosting and standard boosting scheme, as described in the paper "Dynamic boosting".

  • Addition of new bootstrap types bootstrap_type, subsample. Making the use of Bernoulli bootstrap type with subsample < 1 might increase the training speed.

  • Better logging for cross-validation, and also the addition of the parameter logging_level and metric_period (should be set in training parameters) to cv.

  • The Addition of a separate train function that receives the parameters and in return provides a trained model.

  • Ranking mode QueryRMSE now supports default settings for dynamic boosting.

  • R-package pre-build binaries are also now included into release.

The team has now also added many synonyms to their parameter names, now it is, therefore, more convenient for the new users to try CatBoost if one is used to some other library.

Bug Fixes and Other Changes:

  • Fix for CPU QueryRMSE with weights.

  • Addition of several missing parameters into wrappers.

  • Fix for data split in querywise modes.

  • The Benefit of Better logging.

  • From this release, the users will be provided with pre-build R-binaries.

  • More parallelisation than before.

  • Improvements related to Memory usage.

  • And some other bug fixes are in the list as well.

For More Information: Catboost-Release 0.6

Github : CatBoost