Desktop version

Home arrow Computer Science

  • Increase font
  • Decrease font


<<   CONTENTS   >>

Random Forest

Next, we will have a look at the Random Forest. A Random Forest is a classification process that consists of several uncorrelated decision trees. All decision trees have grown under a certain type of randomization during the learning process (Ho, 1995). For a classification, each tree in that forest may decide, and the class with the most votes decides the final classification. A single decision tree brings the problem of high variance with it. This means that if a tree is trained on one half of the training data and another tree on the other half, they can look very different. In general, this can be described in ML as overfitting (Ho, 1998). Among other things, this has a negative effect on the generality of the prediction of a decision tree. Random Forest, in contrast, uses feature bagging (Breimann et al., 2001) so that each forest considers a randomly selected subset of the available data. The results of each individual forest are then averaged, resulting in a flat output that allows deviations to be minimized. Bagging is used for a more stable prediction result. Bagging means the use of an ensemble of parallel estimators, where each of these estimators overfits the data and averages the results. In this case, trees are trained on samples of the training data generated by bootstraping. These trees are combined using a mean value. So, Random Forest relies on aggregating the result of an ensemble of simple estimators (Geron, 2017). The resulting sum is greater than its parts, as the sum is the majority voting from a lot of estimators that can end up being better than the vote of an estimator alone.

To apply the algorithm to the existing datasets, Scikit-Learn and the Random Forest Classifier class were used. The structure of Random Forest allows various adjustable parameters which are reflected in the tree depth, the leaves of the trees, and the branches and splits (Geron, 2017). These parameters are valuable when it comes to adjusting the model for the best possible accuracy. We first looked at four parameters individually to get a general overview of how they affect accuracy. From our model approach, the following parameters of adjustment realizations can be obtained. First, the distribution of the min_samples_split should remain at the lowest allowable value, which is 2. Similarly, the reduction of both the minimum_impurity_decrease and the minimum_weight_fraction of the data shows no benefit from the increase, so both are set to 0. Both the maximum_leaf_nodes and the maximum_depth indicate that they should be set relatively high to capture the descriptive performance in the dataset. Finally, the Min_Sample_Leaf shows a high variance for already small changes in the parameter value. A low value close to the default value of 1 is optimal. In addition, large values of maximum_depth lead to an overfitting of the training dataset, so we take care to avoid this high variance situation. One last parameter that has a big influence on the Random Forest result is the maximum_features (Bernard et ah, 2009). In all our adjustments, increasing the number of trees leads to a higher accuracy only until a certain point (n_estimators = 1,000). As we took care to not overfit our model, this number for Random Forest was set of well-fit trees. In addition, the training time increases rapidly, thus the number of trees (n_estimators) is not so much a parameter, as a trade-off between training time and accuracy adjustments. As a result of our adjustments in our model, an accuracy of 100% was achieved on the training set. We will consider again the ROC and the confusion matrix. Comparing with the KNN algorithm, we achieved a hugely better result.

Random Forest Classifier Confusion Matrix ([[335, 96],

[59, 546]])

The first row of the matrix describes 335 not claims (true negatives) which is an almost 100 cases better than the KNN did correctly, and we have less (96) wrongly classified as claims (false positives) classes than we reached before.

Also, the second row (positive class) reduces to 59 cases that were wrongly classified as non-claims (false negatives). Finally, also a raise of 48 in contrast to the KNN, and here are now 546 sets that were correctly classified as claims (true positives).

There are few parameters left to adjust the ML algorithm. Boosting for instance is a process that provides an efficient decision rule for a classification problem by combining several simple rules. "Die result, however, that is, the accurate decision rule, is called a strong classifier. "Die idea of using boosting methods for classification problems is relatively new (Schapire, 1990). AdaBoost has delivered amazingly good results for all classification problems. Therefore, we want to test this method also on our Cyber Insurance dataset. The AdaBoost classifier can be build up on a Decision Tree Classifier in Scikit-Learn and is performed on sequential trainings. Although, the results of boosting are very interesting, we have made the experience that the computational resources are very big, and the results which were made (Figure 8.4e-8.4g) show that the ROC for the Random Forest Classifier itself were better.

Even the confusion matrix results make this conclusion - although it is very close to the Gradient Boosting results, as only the wrongly classified and true positives differ.

Adaboost Confusion Matrix ([[316, 115],

[90, 515]])

Gradient Boosting Confusion Matrix ([[335, 96],

[85, 520]])

To further understand how to interpret the performance of the random forest algorithm, it is applied based on unused customer datasets. Here, based on the case from the customer data records, we generate the prediction of a future claim: Insurance Claim Probability array [0.109, 0.891] and Insurance Claim Prediction array [1]. That means that the dataset has a 11% chance of being class 0 and 89% chance of being class 1.

Many model forms describe the underlying impact of features relative to each other. In Scikit-Learn, Decision Tree models and ensembles of trees such as Random Forest provide a feature importance attribute when fitted. This utilizes this attribute to rank and plot relative importance. Our research results show (Figure 8.4h) the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importance of the forest, along with their inter-tree’s variability. The plot suggests that the most important features is CC/PII data, where all the other features are equally important, Other IT insurance is not.

 
<<   CONTENTS   >>

Related topics