Xgboost vs random forest

8/24/2023

In the original posting I spent a great deal of time explaining the mechanics of loading and prepping the data. Being able to predict churn even a little bit better could save us lots of money, especially if we can identify the key indicators and influence them. Obviously we'd like to be able to predict this phenomenon and potentially target these customers for retention or just better project our revenue. Imagine yourself in a fictional company faced with the task of trying to predict which customers are going to leave your business for another provider a.k.a. A great use case for the algorithms we'll be using. It's a very practical example and an understandable dataset. We're going to use a dataset that comes to us from the IBM Watson Project. Predicting customer churn for a fictional TELCO company Theme_set(theme_bw()) # set theme for ggplot2 Require(kableExtra) # just to make the output nicer You'll also get a variety of messages, none of which is relevant to this example so I've suppressed them. ranger and xgboost are available from CRAN and are straightforward to install. CHAID isn't on CRAN but I have provided the commented out install command below. If you've never used CHAID before you may also not have partykit. I do believe CHAID is a great choice for some sets of data and some circumstances but I'm interested in some empirical information, so off we go.

I'll use the exact same data set for all three so we can draw some easy comparisons about their speed and their accuracy. In this post I'll spend a little time comparing CHAID with a random forest algorithm in the ranger library and with a gradient boosting algorithm via the xgboost library. Quoting myself, I said “As the name implies it is fundamentally based on the venerable Chi-square test – and while not the most powerful (in terms of detecting the smallest possible differences) or the fastest, it really is easy to manage and more importantly to tell the story after using it”. Note that “gain” would be the most similar to what I said before.In an earlier post, I focused on an in-depth visit with CHAID (Chi-square automatic interaction detection). Where coverage is defined as the number of samples affected by the split. “cover” is the average coverage of splits which use the feature “gain” is the average gain of splits which use the feature. “weight” is the number of times a feature appears in a tree. How the importance is calculated: either “weight”, “gain”, or “cover”. It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.įor XGBoost, the ot_importance method gives the following options to plot the variable importances: There they quote this post in Stack Overflow explaining how the above mechanism is implemented in scikitlearn: For Random Forest, I recommend you read this cool post from Jeremy and Terence explaining the perils of this technique and why they prefer another mechanism, permutation importance. More practical one can come from the docs of the respective libraries.

That is the best concise theoretical explanation that I can do. Then, average across all the trees used in each ensemble. Thus, both Random Forest and XGBoost generalize this method: for each tree, do the method above. The squared relative importance of variable is the sum of such squared improvements over all internal nodes for which it was chosen as the splitting variable.> The particular variable chosen is the one that gives maximal estimated improvement () in … risk over that for a constant fit over the entire region. The feature importance in both cases is the same: given a tree go over all the nodes of the tree and do the following: ( From the Elements of Statistical Learning p.368 (freely available here)):Īt each such node t, one of the input variables Xv(t) is used to partition the region associated with that node into two subregions within each a separate constant is fit to the response values.

0 Comments

Xgboost vs random forest

Leave a Reply.

Author

Archives

Categories