Home Computer Science
K-Nearest Neighbors Classifier
К-Nearest Neighbors is an algorithm for supervised learning, where the data is “trained” with data points corresponding to their classification. Once a point is to be predicted, it considers the “K” nearest points to it to determine its classification. The typical dataset of this type of algorithm is made up of several descriptive attributes and a single objective attribute (also called class).
In our problem, we are going to show how it is applied on the Turnover. Its probability distribution function was depicted in Figure 9.1a. The goal is to build a classifier to be able to predict the class of unknown cases. To do this, we select a specific type of classifier, which is called К-Nearest Neighbors. From the set of available data, we arrange them in the following way:
"Die result of the algorithm for classification draws that the best accuracy is 0.07307692307692308, which is obtained with К = 2.
Decision Trees are a non-parametric supervised learning method used for classification and regression. The objective is to create a model that predicts the value of an objective variable by learning simple decision rules inferred from the characteristics of the data.
For example, decision trees learning from the data can approximate a sinusoidal curve with a set of if-then-else decision rules. The deeper the tree is, the more complex the decision rules will be and the more appropriate the model will be.
The decision trees have a first node called root (root), and then the other input attributes are broken down into two branches (they could be more, but we will not get into that now) posing a condition that may be true or false. Each node is forked in two, and they are subdivided again until they reach the leaves that are the final nodes and that are equivalent to answers to the solution: Yes/No, Buy/Sell, or whatever we are classifying.
Some of the advantages of decision trees are as follows:
Disadvantages of decision trees include the following:
In this case, we focus our attention on the assessment result, the Rating, based on the input parameters, Turnover, which gives an idea about how is a company exposed to a cyber-attack, Other IT insurances that the company has contracted, CC/PII data hosted by the company, the Cyber Investment which the company has spent to improve its information and communications technology (ICT) infrastructures and security to reduce cyber risk, KRITIS, the Insurance claim the company has faced due to successful cyber-attacks. The Train set is composed of 907 inputs-target sets, while the Test set is composed of 389 inputs-target sets.
In the approach shown in this work, we used the given Rating parameter as a goal. This finally leads to the decision contract/do- not-contract based on the achieved Rating. So, in future investigations and developments, we will first categorize and map the Rating score into YES/NO decision, which is the first step toward the process automation.
To optimize the process, further analysis of the different parameters used to build the decision tree is needed to find out features that lead to better and optimal solutions.
Figure 9.4 shows the resulting decision tree. Some remarks must be drawn to fully understand the given result. When analyzing the problem and implementing a solution, we faced two possibilities: either using a Decision Tree Classifier or a Decision Tree Regressor. The first approach requires the target to be clearly organized in categories or, for example, binary decisions: YES/NO. Given the intention of this work is using the raw data provided by companies, that idea was left for further investigation, and we use the raw Rating data. So, we focus on the second approach, that is, using the Decision Tree Regressor to avoid preprocessing the data and get an open solution.
Tie decision involves some advantages/drawbacks, as follows:
Support Vector Machines
Support vector machines (SVMs) have their origin in the work on the theory of statistical learning and were introduced in the 90s by Vapnik and his collaborators (Boser et al., 1992; Cortes and Vapnik, 1995). Although SVMs were originally intended to solve binary classification problems, they are currently used to solve other types of problems (regression, grouping, and multiclassification). There are also diverse fields in which they have been used successfully, such as artificial vision, character recognition, categorization of hypertext text, protein
Figure 9.4 Decision tree resulting from training and test.
classification, natural language processing, and time series analysis. In fact, since its introduction, they have been earning a deserved recognition, thanks to their solid theoretical foundations.
SVM is a machine learning technique that finds the best possible separation between classes. With two dimensions, it is easy to understand what you are doing. Normally, machine learning problems have many dimensions. So, instead of finding the optimal line, the SVM finds the hyperplane that maximizes the margin of separation between classes.
The dataset used in this investigation consists of sample records concerning several hundred cyber risk assessment reports, each of which contains the values of a set of features related to cyber risk. The fields in each record are the ones described above, that is, Turnover, Other IT Insurance, CC/PII, Rating, KRITIS, Cyber Invest.
In the data analytics applied to this dataset, first we will analyze how Cyber Investment and CC/PII impact on the Insurance claims registered by companies which were affected in some way.
Figure 9.5a shows the relation of the insurance claims due to damages produced by succeeding cyber-attacks based on the company investment on cyber security and the risk of holding credit cards and personal data which are usually a target of cyber crimes. As expected, the more the investment in cyber security, the lower the success probability.
Further, continuing with the analysis, Figure 9.5b shows the scores which the algorithm obtains after its analysis.
Finally, Figure 9.5c shows the distribution of the confusion matrix based on the values of the Cyber Investment of the company and the CC/PII. It is worthy to note that the similarity score is 0.5423076923076923, when comparing the estimated values versus the test dataset.