Ensembles of Decision Trees
Ensembles are a type of machine learning technique that combines multiple models to create a more powerful, accurate, and robust model. Two common ensembles are bagging and boosting —
Bagging
Bagging (or bootstrap aggregation) is an ensemble technique used to reduce the variance of predictions by combining the results of multiple models on different subsets of same data. Each model is given weight and each model is built independently. There are 3 most common bagging algorithms — BaggingClassifier/Regressor, Random Forest and ExtraTress
BaggingClassifier/Regressor
- Create random subsamples of data with replacement
- Train a decision tree on each subsample
- Given a new dataset, average prediction from each model
Random Forest
This works similar as earlier one. The difference lies in feature used for splitting in nodes. In random forest, only a random subset on features (typically of size sqrt(num of features)) is choosen and the split is done which gives the highest gini index using the feature subset instead of the entire feature set. Out of bag (OOB) samples are the samples which never came in any subsample and can be used for evaluating the model.
ExtraTrees
There are 2 differences between Random Forest and ExtraTrees —
- In extratrees, a random subset of features are used (just like random forest) and thresholds are drawn at random for each candidate feature and best of these randomly generated threshold is picked as splitting rule (the threshold which gives the highest gain is chosen in random forest)
- ExtraTrees uses the whole original sample while random forest uses bootstrap replicas.
Few Important Things to note -
- Sklearn implementations don’t handle missing values
- Bagging being based decision trees are not affected by outliers as they are typically binned
- Categorical features need to be explicitly handled (one-hot etc.)
- Relatively prone to overfitting.
- Scaling is not typically required in tree based models.
Boosting
Boosting reduces the bias of the predictions of the decision trees. Boosting algorithms are one of the most used algorithms in the industry because of their great performance in tabular datasets. In this blog, we will learn about the Adaboost algorithm — one of the early boosting algorithm , the original gradient boosting algorithm followed by the most famous variations — XGBoost, LightGBM and Catboost.
What is boosting? Boosting is a supervised machine learning strategy that combines the predictions of multiple weak models (base models) to generate a powerful ensemble model. Boosting, as opposed to classic ensemble approaches like bagging or averaging, focuses on successively training the basic models in a way that emphasizes misclassified samples from prior iterations. The goal is to prioritize samples that were incorrectly categorized in previous iterations, allowing the model to learn from its mistakes and improve its performance iteratively.
Adaboost
It generally only uses stumps (tree with just one node and two leaves)
- All samples get equal weight (1/ num samples)
- All features are used to split the dataset into 2 buckets. The feature which had the maximum gini index as choosen
- The amount of say the stump has is based on the error/misclassification it makes. (0.5 * log (1 — Error / Error))
- For new sample weights, increase the weight of wrongly classified samples to sample_weight * e ^ amount of say and reduce the weight of correctly classified samples to sample_weight * e ^ — amount of say.
- Normalize the weights by dividing each new weight / total weight
- Create a new dataset by randomly selecting rows with prob. of picking row = sample weight
- Reinitiatize the sample weights of the new dataset to 1/ num_sample
- Choose another feature for the node of the stump and run same process again.
For prediction, run the sample across all trees, divide the trees into classifying 0 and 1. Add up the amount of say, the class having the higher total say will be chosen for the sample.
What is gradient boosting ? Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals.
Original Gradient Boosting Algorithm
Its very very essential to know the original gradient boosting algorithm even if you know the most common algorithms known above. Lets look at the overall algorithm first and then we will go into the details —
A) For regression —
We give an initial prediction of label for all samples— commonly mean of the all the labels (why so you may ask — It comes from the paper that introduced Gradient Boosting which in simple terms says that we have to initialize with a value that has the least loss function which is Mean Squared Error in most cases, so after calculation it comes that mean of all values reduces to loss function to least ~0)
Next we calculate (psuedo) residuals (actuals — initial predictions) for all the samples. (why so you mask ask again — Again it comes from the paper 😂 here we calculate psuedo residuals which is essentially the derivative of the loss function w.r.t to the predicted values which comes to be actuals — predictions). This is where the “Gradient” in the Gradient Boosting comes from.
Now the first decision tree uses the features to predict the residuals.
The residuals of each leaf node are replaced with their average. The initial predictions are replaced by — initial prediction + learning rate * average residual of the leaf in which the sample resides. (Why average again? We need to choose the value of the leaf which minimizes the loss function, here we also take into consideration the previous prediction F^m-1 and it only considers samples in that leaf node for minimizing the loss instead of all samples. Like we saw in step 1 this is the average of the samples in leaf).
New prediction = initial prediction + lr * residual from 1st tree + lr * residual from 2nd tree + …. Here learning rate (v) is generally between 0 and 1.
Now this is continued till the number of trees we would want to build (typically 100) and this time the predictions will be the one we calculated in the previous steps.
XGBoost (Xtreme Gradient Boosting)
A) For regression following steps are followed —
- First step is to make an initial prediction of y variable (0.5 by default).
2. Club all residuals into one node. Calculate similarity score = Sum of residuals squared / (Num of residuals + lamdba/regularizer)
3. Check for all possible split and find gain = SimScoreLeft + SimScoreRight — SimScoreRoot
4. The split that gave the largest gain is used
5. This is continued till required depth is reached
6. Then the tree is pruned based on a threshold gamma. If gain < gamma, the leaf node is removed and this is continued till gain > gamma.
7. Output value = Sum of residuals/(Num of residuals + lambda) is calculated for each leaf node.
8. New prediction = initial prediction + lr * output value tree1
9. New residual is smaller than earlier. This is continued till the number of trees and final prediction = initial prediction + lr * output value tree 1 + … + lr * output value tree N
For classification, a very similar approach is taken.
Similarity score = Sum of residual squared / Sum of (Prev. prob * 1 — prev. prob) + lambda.
Output Value = Sum of residual / Sum of (Prev. prob * 1 — prev. prob) + lambda.
Calculate log(odds) = initial log(odds) + output and convert it to prob using sigmoid function.
LightGBM
LightGBM uses Gradient based One-Side sampling (GOSS) to filter out data instances for finding a split value. LightGBM provides direct support of categories as long as they are encoded to integers when searching for optimal split. It will look for best way of partitioning the possible categories into 2 subsets. Instead of looking at all combinations, categories are sorted according to training objective and then splitted. LightGBM uses leaf-wise approach.
Catboost
- Catboost builds symmetric trees — Same feature is present at a level. This helps as it gives stable results with changes in hyperparemeters (gives good results even with base parameters)
- Handles Categorical variables inherently — For low cardinality categorical variables, it does one-hot encoding within the algorithm (no preprocessing used) which increases speed and quality. For high cardinality categorical variables, it does ordered target encoding. (for prediction, it uses average of training data)
- Ordered Boosting to reduce overfitting — Its an alternative to classical boosting where instead of calculating gradients based on the same leaf values, it calculates gradients based on a permutation that the tree has not seen before.
- It supports missing value by replacing missing values by ‘Min’ or ‘Max’ and supports handling imbalanced data using “scale_pos_weight” parameter.
Hope this helps in understanding working of the boosted trees. Please do provide your feedback in form of responses and claps :). Also do follow for more of such content on ML and DL.
References —
[1] https://www.youtube.com/@statquest/playlists
[2] https://jerryfriedman.su.domains/ftp/stobst.pdf