Home Mathematics



Advanced Regression Techniques with ExamplesIn this section, we will consider sine regression, onepredictor logistics regression, and onepredictor Poisson regression. First, consider data that has an oscillating component. Nonlinear Regression Example 6.4. Model Shipping by Month. Management is asking for a model that explains the behavior of tons of material shipped over time so that predictions might be made concerning future allocation of resources. Table 6.4 shows logistical supply train information collected over 20 months. TABLE 6.4: Total Shipping Weight vs. Month
First, we find the correlation coefficient. According to our rules of thumb, 0.67 is a moderate to strong value for linear correlation. So is the model to use linear? Plot the data, looking for trends and patterns. Figure 6.4a shows the data as a scatterplot, while 6.4b “connects the dots.” FIGURE 6.4: Shipping Data Graphs Although linear regression can be used here, it will not capture the seasonal trends. There appears to be an oscillating pattern with a linear upward trend. For purposes of comparison, find a linear model. The R^{1} value of 0.45 does not indicate a strong fit of the data as we expected. Since we need to represent oscillations with a slight linear upward trend, we’ll try a sine model with a linear component As noted before, good estimates of the parameters a* are necessary for obtaining a good fit. Check Maple’s fit with default parameter values! Use the linear fit from above for estimating oo and ai; use your knowledge of trigonometry to estimate the other parameters.
Use Nonlinear Regression from the PSMv2 package.
The coefficient’s pvalues look very good, save the phase shift a4. Plot the model with the data. This model captures the oscillations and upward trend nicely. The sum of squared error is only SSE = 21.8. The new SSE is quite a bit smaller than that of the linear model. Clearly the model based on sine+linear regression does a much better job in predicting the trends than just using a simple linear regression. Example 6.5. Modeling Casualties in Afghanistan. In a January 2010 news report, General Barry McCaffrey, USA, Retired, stated that the situation in Afghanistan would be getting much worse.^{6} General McCaffrey claimed casualties would double over the next year. The problem is to analyze the data to determine whether it supports his assertion. The data that Gen. McCaffrey used for his analysis was the available 2001 2009 figures shown in Table 6.5. The table also shows casualties for 2010 and part of 2011 that were not available at the time. TABLE 6.5: Casualties in Afghanistan by Month
First, do a quick “reasonability model.” Sum the numbers across the years that Gen. McCaffrey had data for (Table 6.6) and graph a scatterplot. TABLE 6.6: Casualties in Afghanistan by Year
The scatterplot’s shape suggests that we use a parabola as our “reason ability model.”
The model’s prediction, while much smaller than the actual 2010 value, is not a doubling. However, the model does suggest further analysis is required. We will focus on the four years before 2010, that is 2006 to 2009, and ask, do we expect the casualties in Afghanistan to double over the next year, 2010, based on those casualty figures?
In the same fashion as before, plot both a scatterplot and a line plot of the data available to Gen. McCaffrey over that period. See Figure 6.5. The line plot may better show trends in the data, such as an upward tendency or oscillations that are not apparent in the scatterplot. However, a line plot can be very difficult to read or interpret when there are a large number of data points connected. Good graphing is always a balancing act. After modeling the data from 2006 to 2009, we can use the 2010 values to test our model for goodness of prediction. There are two trends apparent from the graphs. First, the data oscillates seasonally. This time, however, the oscillations grow in magnitude. We will try to capture that with an x • sin(ai) term. Second, the data appears to have an overall upward trend. We will attempt to capture that feature with a linear component. The nonlinear model we choose is
Using the techniques described in the previous example, we fit the nonlinear model: a growingamplitude sine plus a linear trend. We estimate the parameters from the scatterplot: FIGURE 6.5: Afghanistan Casualties Graphs Use our NonlinearRegression program. The p=values for all the parameters, except the constant term, are quite good. Plot a graph to see the model capturing the oscillations and linear growth fairly well. Does the model also show the increase in amplitude as well? Considering the residuals will be our next diagnostic.
Now graph the residuals looking for patterns and warning signs. The residual plot shows no clear pattern suggesting the model appears to be adequate. Although we note that the model did not “keep up” with the change in amplitude of the oscillations. What does the model predict for 2010 in relation to 2009?
This model does not show a doubling effect from year four. Thus, the model does not support General McCaffrey’s hypothesis. Consider the ratios of casualties for each month of 2009 to 2008 and then 2010 to 2009. How would this information affect your conclusions? Logistic Regression and Poisson Regression Often the dependent variable has special characteristics. Here we examine two notable cases: (a) logistic regression, also known as a logit model, where the dependent variable is binary, and (b) Poisson regression where the dependent variable measures integer counts that follow a Poisson distribution. OnePredictor Logistic Regression We begin with three onepredictor logistic regression model examples in which the dependent variable is binary, i.e., {0,1}. The logistic regression model form that we will use is The logistic function, approximating a unit step function, gave the name logistic regression. The most general form handles dependent variables with a finite number of states. Example 6.6. Damages versus Flight Time. After a number of hours of flight time, equipment is either damaged or not. Let the dependent variable у be a binary variable with
and let t be the flight time in hours. Over a reporting period, the data of Table 6.7 has been collected. TABLE 6.7: Damage vs. Flight Time
Calculate a logistic regression for damage. Now, the fit. The analyst must decide over what intervals of x we call the у probability a 1 or a 0 using the logistic 5curve shown from the fit. We switch from times to time differentials in the next example. Example 6.7. Damages vs. Time Differentials. Replace the times in the previous example with time differentials given in Table 6.8. TABLE 6.8: Damage vs. Time Differentials (TD)
Repeat the procedure of the previous example.
Once again, the analyst must decide over what intervals of x we call the у probability a 1 or a 0 using the logistic 5curve shown above. Dehumanization is not a new phenomenon in human conflict. Societies have dehumanized their adversaries since the beginnings of civilization in order to allow them to seize, coerce, maim, or ultimately to kill while avoiding the pain of conscience for committing these extreme, violent actions. By taking away the human traits of these opponents, adversaries are made to be objects deserving of wrath and meriting the violence as justice.^{[1]} Dehumanization still occurs today in both developed and underdeveloped societies. The next example analyzes the impact that dehumanization has in its various forms on the outcome of a state’s ability to win a conflict. Example 6.8. Conflict and Dehumanization. To examine dehumanization as a quantitative statistic, we combine a data set of 25 conflicts from Erik Melander, Magnus Oberg, and Jonathan Hall’s “Uppsala Peace and Conflict,” (Table 1, pg. 25)^{[2]} with .Toakim Kreutz’s “How and When Armed Conflicts End: Introducing the UCDP Conflict Termination Dataset”^{[3]} to have a designated binary “winlose” assessment for each conflict. We will use civilian casualties as a proxy indicator of the degree of dehumanization during the conflict. The conflicts in Table 6.9 run the gamut from high to lowintensity in the spectrum, and include both inter and intrastate hostilities. Therefore, the data is a reasonably general representation. TABLE 6.9: Top 25 Worst Conflicts Estimated by WarRelated Deaths
denotes missing values. Bv including the ratio of civilian casualties to total casualties in Table 6.9, we are able to determine what percentage of casualties in each conflict is civilian. This ratio provides a quantifiable variable to analyze. Binary logistic regression analysis is the first method to choose to analyze the interrelation of dehumanization’s effects (shown by proxy through higher percentages of civilian casualties) on the outcome of conflict as a win (1) or a loss (0). This type of regression model will allow us to infer whether or not the independent variable, civilian casualties percentage, has a statistically significant impact on the conflict’s outcome, win or lose. Using the data from Table 6.9, we assign the civilian casualty percentages to be the independent variable and Side A’s win/loss outcome of the conflict to be the binary dependent variable, then develop a binary logistic regression model. Use Maple to derive the logistic regression statistics from the model as follows. We derive estimates of the parameters from the data. (See, e.g., Bauldry [B1997] for simple methods.) Take a = —1.9 and b = 0.05 initially.
This result does not pass the common sense test. Ask Maple for more information by increasing infolevel. Maple’s NonlinearFit could not optimize the regression. Let’s try our Nonlinear Regression.
This logistic model result appears much better at first look. However, the coefficients’ Pvalues tell us to have no confidence in the model. Graph the model with the data! Analysis Interpretation: The conclusion from our analysis is that the civilian casualty percentages are not significantly correlated with whether the conflict leads to a win or a loss for Side A. Therefore, from this initial study, we can loosely conclude that dehumanization does not have a significant effect on the outcome of a state’s ability to win or lose a conflict . Further investigation will be necessary. OnePredictor Poisson Regression According to Devore [D2012], the simple linear regression model is defined by: There exists parameters do, di, and a^{2}, such that for any fixed input value of x, the dependent variable is a random variable related to x through the model equation Y = do + di* + £. The quantity e in the model equation is the “error”—a random variable assumed to be normally distributed with mean 0 and variance a^{2}. We expand this definition to when the response variable у is assumed to have a normal distribution with mean p_{y} and variance a^{2}. We found that the mean could be modeled as a function of our multiple predictor variables, xi,X2, ■ ■ ■, x_{n}, using the linear function Y = do + di^{3}’! + l%^{x}2 + • • • + dfc^fcThe key assumptions for least squares are
If any assumption is not satisfied, the model’s adequacy is questioned. In first courses, patterns seen or not seen in residual plots are used to gain information about a model’s adequacy. (See [AA1979], [D2012]). Normality Assumption Lost In logistic and Poisson regression, the response variable’s probability lies between 0 and 1. According to Neter [NKNW1996], this constraint loses both the normality and the constant variance assumptions listed above. Without these assumptions, the F and t tests cannot be used for analyzing the regression model. When this happens, transform the model and the data with a logistic transformation of the probability p, called logit p, to map the interval [0,1] to (—oo,+oo), eliminating the 01 constraint: The /3s can now be interpreted as increasing or decreasing the “log odds” of an event, and exp(/3) (the “odds multiplier”) can be used as the odds ratio for a unit increase or decrease in the associated explanatory variable. When the response variable is in the form of a count, we face a yet different constraint . Counts are all positive integers corresponding to rare events. Thus, a Poisson distribution (rather than a normal distribution) is more appropriate since the Poisson has a mean greater than 0, and the counts are all positive integers. Recall that the Poisson distribution gives the probability of у events occurring in time period t as Then the logarithm of the response variable is linked to a linear function of explanatory variables. Thus In other words, a Poisson regression model expresses the “log outcome rate” as a linear function of the predictors, sometimes called “exposure variables.” Assumptions in Poisson Regression There are several key assumptions in Poisson regression that are different from those in the simple linear regression model. These assumptions include that the logarithm of the dependent variable changes linearly with equal incremental increases in the exposure variable; i.e., the relationship between the logarithm of the dependent variable and the independent variables is linear. For example, if we measure risk in exposure per unit time with one group as counts per month, while another is counts per years, we can convert all exposures to .strictly counts. We find that changes in the rate from combined effects of different exposures are multiplicative; i.e., changes in the log of the rate from combined effects of different exposures are additive. We find for each level of the covariates, the number of cases has variance equal to the mean, making it follow a Poisson distribution. Further, we assume the observations are independent. Here, too, we use diagnostic methods to identify violations of the assumptions. To determine whether variances are too large or too small, plot residuals versus the mean at different levels of the predictor variables. Recall that in simple linear regression, one diagnostic of the model used plots of residuals against fits (fitted values). We will look for patterns in the residual or deviation plots as our main diagnostic tool for Poisson regression. Poisson Regression Model The basic model for Poisson regression is
The ith case mean response is denoted by it,, where u, can be one of many defined functions (Neter [NKNW1996]). We will only use the form
We assume that the Y, are independent Poisson random variables with expected value щ. In order to apply regression techniques, we will use the likelihood function L (see [AA1979, D2012]) given by
Maximizing this function is intrinsically quite difficult. Instead, maximize the logarithm of the likelihood function shown below.
Numerical techniques are used to maximize ln(L) to obtain the best estimates for the coefficients of the model. Often, “good” starting points are required to obtain convergence to the maximum ([Fox2012]). The deviations or residuals will be used to analyze the model. In Poisson regression, the deviance is given by where щ is the fitted model; whenever Y, = 0, we set Y, ■ 1п(У)/г1;) = 0. Diagnostic testing of the coefficients is carried out in the same fashion as for logistic regression. To estimate the variancecovariance matrix, use the Hessian matrix //(X), the matrix of second partial derivatives of the log likelihood function ln(L) of (6.4). Then the approximated variancecovariance matrix is FC(X, В) = —//(X)^{1} evaluated at B. the final estimates of the coefficients. The main diagonal elements of VC are estimates for the variance; the estimated standard deviations seg are the square roots of the main diagonal elements. Then perform hypothesis tests on the coefficients using ttests. Two examples using the Hessian follow. Example 6.9. Hessianbased Modeling. Consider the model у; = exp(fo_{0} + /qa;,) for г =1. 2. ..., n. Put this model into (6.4) to obtain The Hessian H = [fty] comes from
which gives the estimate of the variancecovariance matrix VC = —H^=g For the twoparameter model (bo and b), the Hessian is
Change the model slightly adding a second independent variable with a third parameter. The model becomes у; = exp(6o Mqaq, + 1)2*2;) for * = 1, 2, ..., n. Compute the new Hessian and carefully note the similarities. The pattern in the matrix is easily extended to obtain the Hessian for a model with n independent variables. Let г/i = exp(fro + bixu + 62*2, + ■ ■ • + b_{n}x_{rn}). The general Poisson model Hessian is Replace the formulas with numerical values from the data. The resulting symmetric square matrix should be nonsingular. Compute the inverse of the negative of the Hessian matrix to find the variancecovariance matrix VC. The main diagonal entries of VC are the (approximate) variances of the estimated coefficients h,. The square roots of the entries on the main diagonal are the estimates of se(bi), the standard error for Ь_{г}, to be used in the hypothesis testing with t* = b;/se(6;). We now have all the information we need to build the tables for a Poisson regression that are similar to a regression program’s output . Estimating the Regression Coefficients: Summary The number of predictor variables plus one (for the constant term) gives the number of coefficients in the model у; = exp(£>o + bХц + Ь^хц HЬ b_{n}x_{n}i). Estimates of the 6,; are the final values from the numerical search method (if it converged) used to maximize the loglikelihood function ln(L) of (6.4). The values of se(bj), the standard error estimate for /;,. are the square roots of the main diagonal of the variancecovariance matrix VC = —//(X)^^. The values of t* = bi/se(bi) and the pvalue, the probability P(T > f*). In the summary table of Poisson regression analysis below, let m be the number of variables in the model, and let к be the number of data elements of y, the dependent variable. A summary appears in Table 6.10. TABLE 6.10: Poisson Regression Variables Summary
Note that a prerequisite for using Poisson regression is that the dependent variable Y must be discrete counts with large numbers being a rare event. We have chosen two data sets that have published solutions to be our basic examples. First, an outline of the procedure: Step 0. Enter the data for X and Y. Step 1. For Y:
If Y follows a Poisson distribution, then continue. If Y is “count data,” use Poisson regression regardless of the chisquared test. Step 2. Compute the value of bo in the constant model у = exp(feo) that minimizes (6.5); i.e., minimize two times the deviations. Step 3. Compute the values of bo and b in the model у = exp(bo + l>x) that minimize the deviation (6.5). Step 4. Interpret the results and the odds ratio. We’ll step through an example following the outline above. Example 6.10. Hospital Surgeries. A group of hospitals has collected data on the numbers of Caesarean surgeries vs. the total number of births (see Table 6.11).^{[5]} TABLE 6.11: Total Births vs. Caesarean Surgeries
Use the hospitals’ data set to perform a Poisson regression following the steps listed above. Step 0. Enter the data.
Step 1. Plot a histogram, and then perform a Chisquare Goodnessoffit test on yhc, if appropriate. (Note: Maple’s Histogram function is in the Statistics package. There are a large number of options for binning the data; we will use frequency scale = absolute to have the heights of the bars equal to the frequency of entries in the associated bin. Collect the bin counts with Tallylnto.)
Now for the chisquared test. First, generate the predicted values from an estimated Poisson distribution.
We are ready to use Maple’s chisquared test, ChiSquareGoodnessOfFitTest, with a significance level of 0.05. Use the summarize = embed option, as it produces the most readable output. The command is terminated with a colon: “embedding the output” makes it unnecessary to return a result. The chisquared test indicates that a Poisson distribution is reasonable. Step 2. Find the best constant model у = exp(fco). Let’s use Maple’s LinearFit on the function Y = ln(y) = b<). Step 3. Find the best exponential model у = exp(feo + bix). Let’s use Maple’s ExponentialFit to find the model. Step 4. Conclude by calculating the oddsratio. Use the oddsmultiplier exp(/?i) as the approximate oddsratio, often called riskratio for Poisson regression.
OR represents the potential increase resulting from one unit increase in x. (How does this concept relate to “opportunity cost” in linear programming and “marginal revenue” in economics?) Return to the Philippines example relating literacy and violence described in the opening of this chapter. Example 6.11. Violence in the Philippines. The number of significant acts of violence, SigActs in Table 6.12, are integer counts.^{[6]} TABLE 6.12: Literacy Rate (Lit) vs. Significant Acts of Violence (SigActs), Philippines, 2008.
The literacy data has been defined as L, the SigActs as V. Examine the histogram in Figure 6.6 to see that the data appears to follow a Poisson distribution. A goodnessoffit test (left as an exercise) confirms the data follows a Poisson distribution. FIGURE 6.6: Histogram of SigActs Data Use Maple to fit the data. First, remove the three outlier data points with values well over 100, as there are other much more significant generators of violence beyond literacy levels in those regions. We cannot use Maple’s ExponentialFit, as it attempts a logtransformation of SigActs which fails due to 0 values. Plot the fit.
We accept that the fit looks pretty good. The odds multiplier, e^{bl}, for our fit is _{e}°^{05u437} ~ 0.946 which means that for every 1 unit increase in literacy we expect violence to go down « 5.4%. This value suggests improving literacy will help ameliorate the violence. Poisson Regression with Multiple Predictor Variables in Maple Often, there are many variables that influence the outcome under study. We’ll add a second predictor to the Hospital Births problem. Example 6.12. Hospital Births Redux. Revisit Example 6.10 with an additional predictor: the type of hospital, rural (0) or urban (1). the new data appears in Table 6.13. TABLE 6.13: Total Births vs. Caesarean Surgeries and Hospital Type
The data has been entered as B: Total, C: Special, and T: Type. After loading the Statistics package, define the model.
Collect the data and use NonlinearFit to fit the model.
Finishing the statistical analysis of the model is left as an exercise. Exercises
3. For the data in Table 6.14 (a) plot the data and (b) state the type of regression that should be used to model the data. TABLE 6.14: Tire Tread Data
4. Assume the suspected nonlinear model for the data of Table 6.15 is If we use a loglog transformation, we obtain
Use regression techniques to estimate the parameters a, b, and c, and statistically analyze the resulting coefficients. TABLE 6.15: Nonlinear Data
(b) The following data represents change in growth where x is body weight and у is normalized metabolic rate for 13 animals.
6. Use an appropriate multivariablemodel for the following ten observations of college acceptances to graduate school of GRE score, high school GPA, highly selective college, and whether the student was admitted. 1 indicates “Yes” and 0 indicates “No.”
7. The data set for lung cancer in relation to cigarette smoking in Table 6.16 is from Frome, Biometrics 39, 1983, pg. 665674. The number of person years in parentheses is broken down by age and daily cigarette consumption. Find and analyze an appropriate multivariate model. TABLE 6.16: Lung Cancer Rates for Smokers and Nonsmokers
8. Model absences from class where: School: school 1 or school 2 Gender: female is 1, male is 2 Ethnicity: categories 1 through 6 Math Test: score Language Test: score Bilingual: categories 1 through 4
Projects Project 1. Fit, analyze, and interpret your results for the nonlinear model у = a t^{h} with the data provided below. Produce fit plots and residual graphs with your analysis. Project 2. Fit, analyze, and interpret your results for an appropriate model with the data provided below. Produce fit plots and residual graphs with your analysis.
Project 3. Fit, analyze, and interpret your results for the nonlinear model у = at^{b} with the data provided by executing the Maple code below. Produce fit plots and residual graphs with your analysis. Use your phone number (no dashes or parentheses) for PN.

<<  CONTENTS  >> 

Related topics 