Home Mathematics

# Modeling, Correlation, and Regression

Return to the Philippines scenario introduced at the beginning of the chapter. The researcher actually began with linear regression. For example, in 2008, an analysis of the literacy index per region versus the number of significant acts of violence (by terrorists or insurgents) produced the linear model

with correlation coefficient —0.3742 or R? = 0.14 (R2 is the square of the correlation coefficient). The model is plotted in Figure 6.2.

FIGURE 6.2: Philippines: Literacy v. Significant Acts of Violence, 2008

Within all the data, Figure 6.2 represents one of the “best” resulting overall models -note that R2 is only 0.14 which represents a really poor linear fit or a very weak linear relationship. We will return to this scenario again later in the chapter, but suffice it to say, linear regression is not useful to explain, predict, or answer the questions the Philippine government had posed.

# Linear, Nonlinear, and Multiple Regression

Many applications are not linear. For example, a pendulum is modeled by the differential equation 9" + w2sin(0) = 0; this equation can be “linearized” by replacing sin(d) with 0, a good linear approximation for small values of 9. Two widely used rules of thumb for relating correlation to the likelihood of a linear relationship come from Devore [D2012] for math, science, and engineering data, shown in Table 6.1, and a more liberal interpretation from Johnson [.12012] for non-math, non-science, and non-engineering data, given in Table 6.2.

TABLE 6.1: Linearity Likelihood for Math, Science, and Engineering Data; Devore

 Correlation p Relationship 0.8 < p < 1.0 Strong linear relationship 0.5 < p < 0.8 Moderate linear relationship 0.0 < p < 0.5 Weak linear relationship

TABLE 6.2: Linearity Likelihood for Non-Math, Non-Science, and Non- Engineering Data; .Johnson

 Correlation p Relationship 0.5 < p < 1.0 Strong linear relationship 0.3 < p < 0.5 Moderate linear relationship 0.1 < p < 0.3 Weak linear relationship 0.0 < p < 0.1 No linear relationship

In our modeling efforts, we emphasize the interpretation of p « 0. A very small correlation coefficient can be interpreted as indicating either no linear relationship or the existence of a nonlinear relationship. Many neophyte analysts fail to pick up on the importance of potential nonlinear relationships in their interpretation.

Let’s look at an example where we attempt to fit a linear model to data that may not be linear.

Example 6.1. Exponential Decay.

Build a mathematical model to predict the degree of recovery after discharge for orthopedic surgical patients. There are two variables: time t in days in the hospital, and a medical prognostic index for recovery у with larger values indicating a better prognosis. The data in Table 6.3 comes from Neter et al. [NKNW1996, p. 469].

TABLE 6.3: Hospital Stay vs. Recovery Index Data

 t 2 5 7 10 14 19 26 31 34 38 45 52 53 60 65 У 54 50 45 37 35 25 20 16 18 13 8 11 8 4 6

First, a scatterplot of the data.

The scatterplot shows a clear negative trend.

Check the correlation p.

Using either rule of thumb, the correlation coefficient p = —0.94 indicates a strong linear relation. With this value in hand, looking at the scatterplot leads us to think we will have an excellent linear regression model.

Let’s determine the linear model, calling it RI.

It’s time for diagnostics. Calculate the residuals and percent relative errors, then plot the residuals.

Look at the worst percent relative error values.

We’ve obtained the model, у = f(t) = 49.4601 — 0.75251L The sum of squared error is 451.1945, the correlation is —0.94, and R2, the coefficient of determination which is the correlation coefficient squared, is 0.886. These are all indicators of a “good” linear model.

But...

Although some percent relative errors are small, others are quite large, with eight of the fifteen over 20%. The largest two positive errors are ss 67% and 140%, and the largest two negative errors are « —45% and —57%. How much confidence would you have in making predictions with this model? The residual plot clearly shows a curved pattern. The residuals and percent relative errors show that we do not have an adequate model. Advanced courses in statistical regression will show how to attempt to correct this inadequacy.

Further, suppose we need to predict the index when time was 100 days.

A negative value is clearly unacceptable and makes no sense in the context of our problem since the index is always positive. This model does not pass the common sense test.

With a strong correlation of —0.94, what went wrong? The residual plot diagnostic shows a curved pattern. In many regression analysis books, the suggested first-attempt cure for a curved residual is adding a nonlinear term that is missing from the model. Since we only have (t, y) data points, we’ll add a nonlinear term a^x1 to the model.

Example 6.2. Multiple Linear Regression.

Fit a parabolic model f(x) = ao+aix+a,2X2 to the Hospital Stay vs. Recovery Index Data of Table 6.3.

This model, although a parabola, is linear in the “variables” ao, ai, and o2.

Let’s find the fit with Maple.

Use the formula RS = 1 — SSE/SST to compute R2, the coefficient of determination, for the quadratic fit.

These diagnostics look quite good. The sum of the squares of the errors is much smaller. The coefficient of determination R2 is larger, and the correlation has increased to 0.99.

Let’s plot the new residuals to see if there is still a curved pattern.

Use the model to predict the index at 100 days. We find QF(100) = 32.92. The answer is now positive, but again does not pass the common-sense test. The quadratic function is now curving upwards toward posit ive infinity. Graph it! An upward curve is an unacceptable outcome; an increasing curve for the index indicates that staying in the hospital for an extended time leads to a better prognosis. We certainly cannot use the model to predict the values of the index for t beyond 60.

We must try a different model. Looking at the original scatterplot leads us to attempt fitting an exponential decay curve.

Example 6.3. Nonlinear Regression: Exponential Decay.

Fit an exponential decay model f(x) = aoeaiX to the Hospital Stay vs. Recovery Index Data of Table 6.3.[1]

Maple’s NonlinearFit command from the Statistics package does not do well unless we specify reasonable estimates for the parameters of our model, here ao and oi.

Test the fit at t. = 100.

A ridiculous value!

Care must be taken with the selection of the initial values for the unknown parameters (see, e.g., Fox [Fox2012]). Maple yields good models when the calculations are based upon good input parameters. How can we get reasonable approximations to a о and ai? Let's use a standard technique: transform the model to make it linear. Apply logs to our model.

Relabel In(y) = Y and ln(ao) = Ao to obtain Now use Maple to fit the transformed model.

Good estimates to use for the parameters are

This prediction is much closer to what we expect. A common technique used to test the result is to recompute the model with different, but close, initial values. Try it!

Note how close the nonlinear-fitted parameter values are to the estimates. The difference comes from where least squares is applied in the computation: to the logs of data points, rather than to the points. Write the general least squares formulas for the model and the transformed model to see the difference. Also, compare the model from Maple’s ExponentialFit to the one we have found.

For a nonlinear model, linear correlation and R2 have no meaning. Computing the sum of squared error gives SSE = 45.495. This value is substantially smaller than SSE = 451.1945 obtained by the linear model. The nonlinear model appears reasonable. Now check the nonlinear model’s residual plot for patterns.

Figure 6.3 shows the exponential decay model overlaid on the data

FIGURE 6.3: Exponential Decay Fit to Hospital Recovery Data

The percent relative errors are also much improved—only four are larger than 20%, and none are beyond 36.3%. Verify this!

Using this model to predict the index when t = 100 gave у = 1.12. This new result passes the common-sense test. This model would be our recommendation for the hospital’s data.

The text’s PSMv2 library contains Nonlinear Regression, a program we wrote for nonlinear regression in Maple that prints a diagnostic report, and then returns a fit of the specified model.

We see that both coefficients are statistically significant (P-values < 0.05).

Just for illustration, change the model to a + b ■ exp(c • x) and rerun Nonlinear Regression.

Select the model with:

The new constant term a is not statistically significant (P-value 3> 0.05).

Exercises

In Exercises 1 through 4 compute the correlation coefficient and then fit the following data with the given models using linear regression.

1.

 X i 2 3 4 5 У i 1 2 2 4
• (a) у = а + Ьх
• (b) у = ах2
• 2. Data from stretching a spring:
 x(x10-3) 5 10 20 30 40 50 60 70 80 90 100 у (x 105) 0 19 57 94 134 173 216 256 297 343 390
• (a) у = ax
• (b) у = a + bx
• (c) у = ax2
• 3. Data for the ponderosa pine:
 X 17 19 20 22 23 25 28 31 32 33 36 37 39 42 У 19 25 32 51 57 71 113 140 153 187 192 205 250 260
• (a) у = ax
• (b) у = a + bx
• (c) у = ax2
• (d) у = ax3
• (e) у = a:ix3 + U2'.r2 + ax + oo
• 4. Fit the model у = ax3/2 to Kepler’s planetary data:
 Body Period Distance from sun (m) Mercury 7.60 x 10G 5.79 x Ю10 Venus 1.94 x 107 1.08 x 1011 Earth 3.16 x 107 1.5 x 1011 Mars 5.94 x 107 2.28 x 1011 Jupiter 3.74 x 108 7.79 x 10n Saturn 9.35 x 108 1.43 x 1012 Uranus 2.64 x 109 2.87 x 1012 Neptune 5.22 x 109 4.5 x 1012

5. Fit the following data, Experience vs. Flight time, with a nonlinear exponential model у = aebx.

 Crew Experience (months) Total Flight Time Crew Experience (months) Total Flight Time 91.5 23.79812835 116.1 33.85966964 84 22.32110752 100.6 25.21871384 76.5 18.59206246 85 25.63021418 69 15.06492527 69.4 18.52772164 61.5 14.9960869 53.9 12.88023798 80 24.63492506 112.3 31.42422451 72.5 18.87937085 96.7 26.299381 65 18.60416326 81.1 19.6670982 57.5 13.6173644 65.6 15.21024716 50 15.62588379 50 11.977759 103 28.46073261 120 36.33722835 95.5 30.39886433 104.4 29.35965197 88 25.06898139 88.9 21.78073392 80.5 21.30460092 73.7 21.72030963 73 21.98724383 57.8 16.19643014

Projects

Project 6.1. Write a program without using statistical programs/software to

• (a) compute the correlation between the data, and
• (b) find the least squares fit for a general proportionality model у is proportional to xn.

Project 6.2. Write a program without using statistical programs/software to

• (a) compute the correlation between the data, and
• (b) find the least squares fit for a general polynomial model у = «о + ax + a^x1 + • • • + anxn.

• [1] Adapted from Neter et al. [NKNW1996] and Fox [Fox2012|.

 Related topics