Desktop version

Home arrow Computer Science

  • Increase font
  • Decrease font

<<   CONTENTS   >>

Model Selection

Table of Contents:

This work has analyzed several ML approaches to assess companies’ cyber risk insurability.

As demonstrated along the chapter, the best performance was achieved with the classification algorithm and with the Random Forest classifier. In order to evaluate and compare the appropriate choice of the training set as well as the test set, we will make a ranking based on the obtained results, and with this we can create an even better algorithm for our Cyber Insurance dataset. This can be done by aggregating the predictions of each classifier and predict the class by means of the most votes. With Scikit-Learn, a majority or also hard voting can be performed to achieve this result. Even if each classifier would be a weak learner by means of aggregation, it can be a strong learning achieved. Even a better result (Table 8.4) was achieved by means of soft voting. After achieving 100% at the training set, the Random Forest as well as the Voting Classifier achieve 87 and 88%, respectively, on the test dataset.

Our learning methods Random Forest and Gradient Boosting showed similar results, with Random Forest having an advantage. Both are ensemble methods that train many models and obtain results for each one, but they follow different approaches to aggregating

Table 8.4 Results after Performing Soft Voting

Gradient Boosting Classifier


Random Forest Classifier


AdaBoost Classifier


Voting Classifier


their results. Random Forest uses bootstrap aggregation (bagging), which consists of simply selecting a subset. Gradient Boosting goes one step further by boosting certain data points so that they have a greater effect on the resulting model. Whenever a data point in a Gradient Boosting is misclassified, subsequent runs will weight that point higher so that it is highlighted. This sequential learning gives Gradient Boosting an advantage over Random Forest. For our Cyber Insurance dataset entry forecast, however, the Random Forrest is somewhat better and with a use of less computer resources.


Today’s Cyber Insurance underwriting is human-based processing of data. In technical underwriting, this approach is still important as technical underwriting means assessing current risks and established countermeasures at each company, and technical underwriting needs interaction with key stakeholders. On the other hand, economic underwriting is based on few static values so far, even though data breaches and Cyber Incidents are steadily increasing. This chapter surveys various ML algorithms and introduces most of the popular ML algorithms in the context of growing Cyber Insurance demand. The results of the analysis on the customer dataset are encouraging in general.

It can be noted that in the context of the available Cyber Insurance data, no meaningful results were achieved with regression algorithms. We experienced the same with Logistic Regression and Linear SVM Classifier. Thus, these algorithms are not considered any further (Table 8.5).

With Gaussian RBF SVM, Polynomial SVM, KNN algorithms, we achieve meaningful results, and therefore, more research will be considered. Same approach will be used on the best performing

Table 8.5 ML Algorithms - No Further Consideration

Linear Regression

Lasso Regression

Ridge Regression

Logistic Regression

Linear Kernel SVM Classifier

Table 8.6 ML Algorithms - For Further Consideration

Random Forest Classifier

Gradient Boosting Classifier

AdaBoost Classifier

Gaussian RBF SVM

Polynomial SVM

К-Nearest Neighbors

algorithms based on our research: the Random Forest Classifier, Gradient Boosting Classifier, AdaBoost Classifier (Table 8.6).

Further research will concentrate on feature engineering on these algorithms and possible approaches with Machine Learning Platform (MLP) at Deep Neural Networks as well.


Alpaydin, E. (2010). Introduction to machine learning, 2nd. ed. Cambridge: The MIT Press.

Altman, N. (1992). An introduction to kernel and nearest-neighbor non- parametric regression. The American Statistician, 46(3), pp. 175-185. Doi:10.2307/2685209.

Bartolini, D.N., Benavente-Peces, C., Ahrens, A. (2017). Risk assessment and verification of insurability. Proceedings of the 7th International Joint Conference on Pervasive and Embedded Computing and Communication Systems. (PECCS 2017). Madrid: July 24-26, pp. 105-108.

Bartolini, D.N., Zascerinska, J., Benavente-Peces, CJ., Ahrens, A. (2018). Instrument design for cyber risk assessment in insurability verification. Informatics, Control, Measurement in Economy and Environment Protection, 3, pp. 7-10.

Bernard, S., Heutte, L., Adam, S. (2009). Influence of Hyperparameters on Random Forest Accuracy. International Workshop on Multiple Classifier Systems (MCS). Reykjavik, Iceland: June, pp.171-180. https://hal.

Bishop, C.M. (2006). Pattern recognition and machine learning (information science and statistics). Berlin, Heideberg: Springer-Verlag.

Boyd, S., Vandenberghe, L. (2004). Convex optimization. USA: Cambridge University Press.

Breiman, L. (2001). Random forests. Machine Learning, 45, pp. 5-32. Doi. org/10.1023/A:1010933404324.

Breiman, L., Friedman, J. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society. Series В (Methodological), 59(1), pp. 3-54. Retrieved from: stable/2345915. (Access: 19.01.2020).

Biihlmann, P, van de Geer, S. (2011). Statistics for high-dimensional data. Berlin, Heideberg: Springer.

Burnham, K., Anderson, D.R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. New York, NY: Springer- Verlag Inc.

Chang, Y.W., Hsieh, C.J., Chang, K.W. (2010). Training and testing low- degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11, pp. 1471-1490.

Cortes, C., Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, pp. 273-297.

Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55, pp. 78-87. Doi: 10.1145/2347736. 2347755.

Fawcett, T. (2006). An Introduction to ROC analysis. Pattern Recognition Letters, 27, pp. 861-874. Doi: 10.1016/j.patrec.2005.10.010.

General Data Protection Regulation. (2016). Final Version. Retrieved from: en.pdf. (Access: 28.11.2019).

Geron, A. (2017). Hands-on machine learning with Scikit-Learn and Tensor Flow: Concepts, tools, and techniques to build intelligent systems. Sebastopol, CA: O’Reilly Media.

Goldberg, Y., Elhadad, M. (2008). SplitSVM: Fast, space-efficient, non-heuris- tic, polynomial kernel computation for NLP applications. Proceedings of the 46st Annual Meeting of the Association of Computational Linguistics (ACL), pp. 237-240.

Harrell, F.E. (2001). Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis. New York, NY: Springer.

Helmbold, D., Sloan, R., Warmuth, M.K. (1990) Learning nested differences of intersection-closed concept classes. Machine Learning 5, pp. 165-196. Doi:10.1007/BF00116036.

Ho, T.K. (1995). Random decision forest. Proceedings of the 3rd International Conference on Document Analysis and Recognition. Montreal: August, pp. 278-282.

Ho, T.K. (1998). The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Inte/l., 20, pp. 832-844.

Muller, K.R. (2012). Active learning with model selection. In: M., Sugiyama, M. Kawanabe (eds.), Machine learning in non-stationary environments: Introduction to covariate shift adaptation. Cambridge: Tie MIT Press, pp. 215-224.

Muller, K.R. (2017). In caseFrom measurement to machine learning: Towards analysing cognition, 5th International Winter Conference on Brain- Computer Interface,IEEE, Book Series: International Winter Workshop on Brain-Computer Interface. Sabuk, South Korea, pp. 53-54.

PCI Security Standards Council. (2018). Payment Card Industry (PCI) Data Security Standard Requirements and Security Assessment Procedures Version 3.2.1.

Raschka, S., Mirjalili, V. (2019). Python machine learning: Machine learning and deep learning with Python, Scikit-Learn, and Tensorflow 2, 3rd ed. Birmingham: Packt.

Rud, O. (2009). Business intelligence success factors: Tools for aligning your business in the global economy. Hoboken, Ney Jersey: John Wiley 6t Sons Inc.

Samworth, R.J. (2012). Optimal weighted nearest neighbour classifiers. The Annals of Statistics, 40(5), pp. 2733-2763. Doi: 10.1214/12-AOS1049.

Santosa, R, Symes, W.W. (1989). An analysis of least-squares velocity inversion. Geophysical Monograph Series. Tulsa: Society of Exploration Geophysicists.

Schapire, R.E. (1990). Tie strength of weak learnability. Machine Learning, 5(2), pp. 197-227.

Scikit-Learn. (2019). Documentation. 2019.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58, pp. 267-288.

Tikhonov, A.N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics Dok/ady, 4, pp. 1035-1038.

Yan, X. (2009). Linear regression analysis: Theory and computing. Singapore: World Scientific Publishing Company Pte. Ltd.

<<   CONTENTS   >>

Related topics