Desktop version

Home arrow Computer Science

  • Increase font
  • Decrease font

<<   CONTENTS   >>

Linear Regression: Identifying Linear Dependencies

The least squares method consists in calculating the sum of the squared distances between the real points and the points defined by the estimated line from the variables introduced in the model so that the best estimate will be the one that minimizes these distances. In order to decide which model is best suited to the data available in the linear regression model, the partial results obtained in each of the constructed regression models are compared. If we use any of the techniques of selection of previously exposed variables, this coefficient will be calculated each time a variable is eliminated or introduced, since when performing this process, new regression models will be estimated. In all cases, the statistical package performs the operation

Probability distribution function for the different parameters

Figure 9.1 Probability distribution function for the different parameters.

automatically, except if we use the technique of forcing all variables to enter, in which case we will be estimating all possible models manually to subsequently make the selection.

For the application of a linear regression model to be appropriate, it must be satisfied that the response values (y) are independent of each other and the relationship between the variables is linear, as follows:

An example of a linear dependency is Figure 9.2a, which depicts the relation (dependency) between the Turnover and the Cyber Invest. The linear dependency is given in steps, so the model becomes a bit more complex than just a line.

Figure 9.2b shows the relationship between Turnover and CC/PII. In this case, a simple linear relationship between the two variables is not observed, so we must look for other models that are somewhat more complex.

Dependencies among the most relevant parameters

Figure 9.2 Dependencies among the most relevant parameters.

Logistic Regression: When Managing Logical Parameters

The identification of the best logistic regression model is done by comparing models using the quotient and likelihood, which indicates from the sample data that one model is more likely to go against the other. The difference of the likelihood ratios between the two models is distributed according to the Chi-square law with the degrees of freedom corresponding to the difference in the number of variables between both the models. If it cannot be demonstrated from this coefficient that one model is better than the other, then the simplest will be considered as the most appropriate. The linear regression model results in a quantitative variable. If the output variable is qualitative, it cannot be applied directly.

For example, a qualitative variable, called “Other IT Insurance,” which indicates whether other additional insurance has been contracted: “YES” or “NO.” Then, two groups are defined:

  • 0 = NO. They have not contracted additional insurance.
  • 1 = YES. They have hired an additional insurance.

In this case, linear regression cannot be applied to solve the problem. We cannot draw a cloud of points as in the case of linear regression. To solve the problem, we must transform the output variable with the logistics operator.

This mathematical operator tries to convert the group 0 or 1 for a probability that they have contracted additional insurance or not. In this way, we transform the qualitative variable into the number, which is a probability. Afterward, we can use the same structure as in the linear regression. We are simply transforming the qualitative response variable into a quantitative one.

We applied the logistic regression to our variable called “Other IT Insurance,” and we obtained the following results, as depicted in Figure 9.3a and Figure 9.3b. These figures show the main features of the logistic regression for Other IT Insurance. Among those relevant features, we remark the precision and recall for the cases of micro average, macro average, and weighted average. Precision is a measure of the accuracy provided that a class label has been predicted. It is defined by precision = TP/(TP+ FP), where TP states for true positive, and FP states for false positive. Recall is the true positive rate. It is defined as: Recall - TP/(TP + FN), where FN states for false negative.

Features of the logistic regression

Figure 9.3 Features of the logistic regression.

This parameter takes 1,0, so there are no false negative predictions. On the other hand, a considerable number of false positive values are produced, revealing that the prediction accuracy is low.

In order to interpret the result of the logistic regression model, we must resort to the concept of “odds,” one of the measures available to quantify the risk. In this way, the “odds” is defined as the quotient of the probability of presenting a characteristic and the probability of not presenting it, that is, the number of cases showing the characteristic to the number of cases which don’t ratio.

"Die performance of a given model is explained through evaluation metrics. At the most basic level, the assessment of a certain model can compare the actual values versus the predicted values, and the difference will serve to determine the accuracy of the regression model. Hence, evaluation metrics play a relevant role in the process of model development. The evaluation metrics make an insight pointing out how the accuracy can be improved.

<<   CONTENTS   >>

Related topics