XGBoost: A Scalable Tree Boosting System.

July 9, 2018, 3:19 p.m. By: Kirti Bakshi


Tree boosting is a Machine Learning method that is highly effective and used widely. In this paper, we describe XGBoost that is a scalable end-to-end tree boosting system, used widely by data scientists on many Machine Learning challenges in order to achieve state-of-the-art results. Here a novel sparsity-aware algorithm is proposed for sparse data as well as weighted quantile sketch for approximate tree learning. More importantly, they, to build a scalable tree boosting system provide insights on cache access patterns, data compression and sharding. These insights, when combined, XGBoost using far fewer resources than existing systems scales beyond billions of examples.


In this paper, XGBoost, a scalable Machine Learningsystem for tree boosting is described. The impact of the system in a number of machine learning and Data Mining challenges has been widely recognized. One can, for example, take the challenges hosted by the machine learning competition site Kaggle. In the year 2015, Among the 29 challenge winning solutions published on Kaggle’s blog, 17 solutions used XGBoost. and among these, eight to train the model solely used XGBoost, while most others in ensembles combined XGBoost with Neural Nets.

Examples of the problems in these winning solutions include ad click-through rate prediction, web text classification; prediction of store sales; customer behaviour prediction; high energy physics event classification; motion detection.

While on one side, in these solutions domain dependent data analysis and feature engineering play an important role, Being the consensus choice of the learner, XGBoost shows the impact and importance of the system as well as tree boosting.
Scalability due to several important systems and algorithmic optimizations in all scenarios turns out to be the most important factor behind the success of XGBoost. The system runs more than ten times faster on a single machine when compared to existing popular solutions and in distributed or memory-limited settings scale to billions of examples.

These Innovations Include the following:

  • For the handling of sparse data, a novel tree learning algorithm.

  • A theoretically justified weighted quantile sketch procedure enables in approximate tree learning the handling of instance weights.

  • Enabling quicker model exploration due to Parallel and distributed computing as it makes learning faster.

The Major Contributions Of This Paper Are Listed As Follows:

  • The design and build of an end-to-end tree boosting system that is highly scalable.

  • Proposal of a theoretically justified weighted quantile sketch for efficient proposal calculation.

  • Introduction of a novel sparsity-aware algorithm for parallel tree learning.

  • Proposal for an effective cache-aware block structure for out-of-core tree learning.

While there are some existing works on parallel tree boosting, there has been no exploration on directions such as out-of-core computation and sparsity-aware learning. More importantly, an end-to-end system in which all of these aspects are combined gives a novel solution for real-world use-cases. This enables researchers as well as data scientists to build variants of tree boosting algorithms that are powerful.

The Remainder Of The Paper Is Organized As Follows:

  • First, they review tree boosting and introduce a regularized objective in Sec. 2.

  • They then describe the split finding methods in Sec. 3 as well as the system design in Sec. 4, including experimental results when relevant to provide quantitative support for each optimization we describe.

  • Related work is discussed in Sec. 5. D

  • Detailed end-to-end evaluations are included in Sec. 6.

  • Finally, they conclude the paper in Sec. 7.


In this paper, the lessons learnt when building A Scalable Tree Boosting System: XGBoost were described, that is widely used by data scientists and on many problems provides state-of-the-art results, Then a novel Sparsity Aware Algorithm was proposed for handling sparse data and for approximate learning theoretically justified weighted quantile sketch.
Their experience shows that for building a scalable end-to-end system for tree boosting data compression, cache access patterns and sharding are essential elements. These lessons can be applied to other machine learning systems as well. And XGBoost is able to solve real-world scale problems by combining these insights making the use of a minimal amount of resources.

For More Information: GitHub

Link To The PDF: Click Here