Desktop version

Home arrow Mathematics

  • Increase font
  • Decrease font


<<   CONTENTS   >>

Advanced Regression Techniques with Examples

In this section, we will consider sine regression, one-predictor logistics regression, and one-predictor Poisson regression. First, consider data that has an oscillating component.

Nonlinear Regression

Example 6.4. Model Shipping by Month.

Management is asking for a model that explains the behavior of tons of material shipped over time so that predictions might be made concerning future allocation of resources. Table 6.4 shows logistical supply train information collected over 20 months.

TABLE 6.4: Total Shipping Weight vs. Month

Month

Shipped (tons)

Month

Shipped (tons)

1

20

11

19

2

15

12

25

3

10

13

32

4

18

14

26

5

28

15

21

6

18

16

29

7

13

17

35

8

21

18

28

9

28

19

22

10

22

20

32

First, we find the correlation coefficient.

According to our rules of thumb, 0.67 is a moderate to strong value for linear correlation. So is the model to use linear? Plot the data, looking for trends and patterns. Figure 6.4a shows the data as a scatterplot, while 6.4b “connects the dots.”

Shipping Data Graphs

FIGURE 6.4: Shipping Data Graphs

Although linear regression can be used here, it will not capture the seasonal trends. There appears to be an oscillating pattern with a linear upward trend. For purposes of comparison, find a linear model.

The R1 value of 0.45 does not indicate a strong fit of the data as we expected. Since we need to represent oscillations with a slight linear upward trend, we’ll try a sine model with a linear component

As noted before, good estimates of the parameters a* are necessary for obtaining a good fit. Check Maple’s fit with default parameter values! Use the linear fit from above for estimating oo and ai; use your knowledge of trigonometry to estimate the other parameters.

Use Nonlinear Regression from the PSMv2 package.

The coefficient’s p-values look very good, save the phase shift a4. Plot the model with the data.

This model captures the oscillations and upward trend nicely. The sum of squared error is only SSE = 21.8. The new SSE is quite a bit smaller than that of the linear model. Clearly the model based on sine+linear regression does a much better job in predicting the trends than just using a simple linear regression.

Example 6.5. Modeling Casualties in Afghanistan.

In a January 2010 news report, General Barry McCaffrey, USA, Retired, stated that the situation in Afghanistan would be getting much worse.6 General McCaffrey claimed casualties would double over the next year. The problem is to analyze the data to determine whether it supports his assertion.

The data that Gen. McCaffrey used for his analysis was the available 2001- 2009 figures shown in Table 6.5. The table also shows casualties for 2010 and part of 2011 that were not available at the time.

TABLE 6.5: Casualties in Afghanistan by Month

Month

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

1

12

10

25

6

7

21

19

83

199

308

2

13

9

17

5

17

39

18

52

247

245

3

53

14

12

16

7

26

53

78

346

345

4

8

13

11

29

13

61

37

60

307

411

5

2

8

31

34

39

87

117

156

443

6

3

4

34

60

68

100

167

213

583

7

6

10

25

38

59

100

151

394

667

8

2

13

22

72

56

103

167

493

631

9

5

19

34

47

70

88

122

390

674

10

5

8

5

38

27

68

131

90

348

631

11

10

4

27

18

12

51

75

37

214

605

12

28

6

12

9

20

23

46

50

168

359

First, do a quick “reasonability model.” Sum the numbers across the years that Gen. McCaffrey had data for (Table 6.6) and graph a scatterplot.

TABLE 6.6: Casualties in Afghanistan by Year

2002

2003

2004

2005

2006

2007

2008

2009

122

144

276

366

478

877

1028

2649

The scatterplot’s shape suggests that we use a parabola as our “reason- ability model.”

The model’s prediction, while much smaller than the actual 2010 value, is not a doubling. However, the model does suggest further analysis is required.

We will focus on the four years before 2010, that is 2006 to 2009, and ask, do we expect the casualties in Afghanistan to double over the next year, 2010, based on those casualty figures?

In the same fashion as before, plot both a scatterplot and a line plot of the data available to Gen. McCaffrey over that period. See Figure 6.5. The line plot may better show trends in the data, such as an upward tendency or oscillations that are not apparent in the scatterplot. However, a line plot can be very difficult to read or interpret when there are a large number of data points connected. Good graphing is always a balancing act. After modeling the data from 2006 to 2009, we can use the 2010 values to test our model for goodness of prediction. There are two trends apparent from the graphs. First, the data oscillates seasonally. This time, however, the oscillations grow in magnitude. We will try to capture that with an x • sin(ai) term. Second, the data appears to have an overall upward trend. We will attempt to capture that feature with a linear component. The nonlinear model we choose is

Using the techniques described in the previous example, we fit the nonlinear model: a growing-amplitude sine plus a linear trend. We estimate the parameters from the scatterplot:

Afghanistan Casualties Graphs

FIGURE 6.5: Afghanistan Casualties Graphs

Use our NonlinearRegression program.

The p=values for all the parameters, except the constant term, are quite good. Plot a graph to see the model capturing the oscillations and linear growth fairly well. Does the model also show the increase in amplitude as well? Considering the residuals will be our next diagnostic.

Now graph the residuals looking for patterns and warning signs.

The residual plot shows no clear pattern suggesting the model appears to be adequate. Although we note that the model did not “keep up” with the change in amplitude of the oscillations.

What does the model predict for 2010 in relation to 2009?

This model does not show a doubling effect from year four. Thus, the model does not support General McCaffrey’s hypothesis.

Consider the ratios of casualties for each month of 2009 to 2008 and then 2010 to 2009. How would this information affect your conclusions?

Logistic Regression and Poisson Regression

Often the dependent variable has special characteristics. Here we examine two notable cases: (a) logistic regression, also known as a logit model, where the dependent variable is binary, and (b) Poisson regression where the dependent variable measures integer counts that follow a Poisson distribution.

One-Predictor Logistic Regression

We begin with three one-predictor logistic regression model examples in which the dependent variable is binary, i.e., {0,1}. The logistic regression model form that we will use is

The logistic function, approximating a unit step function, gave the name logistic regression. The most general form handles dependent variables with a finite number of states.

Example 6.6. Damages versus Flight Time.

After a number of hours of flight time, equipment is either damaged or not. Let the dependent variable у be a binary variable with

and let t be the flight time in hours.

Over a reporting period, the data of Table 6.7 has been collected.

TABLE 6.7: Damage vs. Flight Time

t

4

2

4

3

9

6

2

11

6

7

3

2

5

3

3

8

У

1

1

0

1

0

0

0

0

1

0

1

1

0

0

0

0

t

10

5

13

7

3

4

2

3

2

5

6

6

3

4

10

У

0

1

0

0

1

0

1

1

0

0

0

1

0

1

0

Calculate a logistic regression for damage. Now, the fit.

The analyst must decide over what intervals of x we call the у probability a 1 or a 0 using the logistic 5-curve shown from the fit.

We switch from times to time differentials in the next example.

Example 6.7. Damages vs. Time Differentials.

Replace the times in the previous example with time differentials given in Table 6.8.

TABLE 6.8: Damage vs. Time Differentials (TD)

TD

19.2

24.1

-7.1

3.9

4.5

10.6

-3

16.2

У

1

1

0

1

0

0

0

0

TD

72.8

28.7

11.5

56.3

-0.5

-1.3

12.9

34.1

У

1

0

0

1

0

0

1

1

TD

6.6

-2.5

24.2

2.3

36.9

-11.7

2.1

10.4

У

0

0

0

0

1

0

1

1

TD

9.1

2

12.6

18

1.5

27.3

-8.4

У

0

0

0

1

0

1

0

Repeat the procedure of the previous example.

Once again, the analyst must decide over what intervals of x we call the у probability a 1 or a 0 using the logistic 5-curve shown above.

Dehumanization is not a new phenomenon in human conflict. Societies have dehumanized their adversaries since the beginnings of civilization in order to allow them to seize, coerce, maim, or ultimately to kill while avoiding the pain of conscience for committing these extreme, violent actions. By taking away the human traits of these opponents, adversaries are made to be objects deserving of wrath and meriting the violence as justice.[1] Dehumanization still occurs today in both developed and underdeveloped societies. The next example analyzes the impact that dehumanization has in its various forms on the outcome of a state’s ability to win a conflict.

Example 6.8. Conflict and Dehumanization.

To examine dehumanization as a quantitative statistic, we combine a data set of 25 conflicts from Erik Melander, Magnus Oberg, and Jonathan Hall’s

“Uppsala Peace and Conflict,” (Table 1, pg. 25)[2] with .Toakim Kreutz’s “How and When Armed Conflicts End: Introducing the UCDP Conflict Termination Dataset”[3] to have a designated binary “win-lose” assessment for each conflict. We will use civilian casualties as a proxy indicator of the degree of dehumanization during the conflict. The conflicts in Table 6.9 run the gamut from high- to low-intensity in the spectrum, and include both inter- and intra-state hostilities. Therefore, the data is a reasonably general representation.

TABLE 6.9: Top 25 Worst Conflicts Estimated by War-Related Deaths

Year

Side A

Side D

Side A: Win= 1 Lose= 0

Civilian

(1,000s)

Military

(1,000s)

Percentage

Civilian

Deaths

1946-48

India

CPI

1

800

0

100.0

1949-62

Columbia

Mil. Junta

1

200

100

66.67

1950-51

China

Taiwan

1

1,000

*

100.0

1950-53

Korea

South Korea

0

1,000

1,889

34.60

1954-62

Algeria/France

FLN

0

82

18

82.00

1956-59

China

Tibet

1

60

40

60.00

1956-65

Rwanda/Tutsi

Hutu

0

102

3

97.14

1961-70

Iraq

KDP

1

100

5

95.24

1963-72

Sudan

Anya Nya

1

250

250

50.00

1965-66

Indonesia

OPM

1

500

*

100.0

1965-75

N. Vietnam

S. Vietnam

1

1,000

1,058

48.59

1966-87

Guatemala

FAR

1

100

38

72.46

1967-70

Nigeria

Rep. Biafra

1

1,000

1,000

50.00

1967-70

Egypt

Israel

0

50

25

66.67

1971-71

Bangladesh

JSS/SB

1

1,000

500

66.67

1971-78

Uganda

Military Fact.

1

300

0

100.0

1972-72

Burundi

Military Fact.

1

80

20

80.00

1974-87

Ethiopia

OLF

1

500

46

91.58

1975-90

Lebanon

LNM

1

76

25

75.25

1975-78

Cambodia

Khmer Rouge

0

1,500

500

75.00

1975-87

Angola

FNLA

1

200

13

93.90

1978-87

Afghanistan

USSR

1

50

50

50.00

1979-87

El Salvador

FMLN

1

50

15

76.92

1981-87

Uganda

Kikosi Maalum

1

100

2

98.04

1981-87

Mozambique

Renamo

1

350

51

87.28

denotes missing values.

Bv including the ratio of civilian casualties to total casualties in Table 6.9, we are able to determine what percentage of casualties in each conflict is civilian. This ratio provides a quantifiable variable to analyze.

Binary logistic regression analysis is the first method to choose to analyze the interrelation of dehumanization’s effects (shown by proxy through higher percentages of civilian casualties) on the outcome of conflict as a win (1) or a loss (0). This type of regression model will allow us to infer whether or not the independent variable, civilian casualties percentage, has a statistically significant impact on the conflict’s outcome, win or lose. Using the data from Table 6.9, we assign the civilian casualty percentages to be the independent variable and Side A’s win/loss outcome of the conflict to be the binary dependent variable, then develop a binary logistic regression model. Use Maple to derive the logistic regression statistics from the model as follows.

We derive estimates of the parameters from the data. (See, e.g., Bauldry [B1997] for simple methods.) Take a = —1.9 and b = 0.05 initially.

This result does not pass the common sense test. Ask Maple for more information by increasing infolevel.

Maple’s NonlinearFit could not optimize the regression. Let’s try our Nonlinear Regression.

This logistic model result appears much better at first look. However, the coefficients’ P-values tell us to have no confidence in the model. Graph the model with the data!

Analysis Interpretation: The conclusion from our analysis is that the civilian casualty percentages are not significantly correlated with whether the conflict leads to a win or a loss for Side A. Therefore, from this initial study, we can loosely conclude that dehumanization does not have a significant effect on the outcome of a state’s ability to win or lose a conflict . Further investigation will be necessary.

One-Predictor Poisson Regression

According to Devore [D2012], the simple linear regression model is defined by:

There exists parameters do, di, and a2, such that for any fixed input value of x, the dependent variable is a random variable related to x through the model equation Y = do + di* + £. The quantity e in the model equation is the “error”—a random variable assumed to be normally distributed with mean 0 and variance a2.

We expand this definition to when the response variable у is assumed to have a normal distribution with mean py and variance a2. We found that the mean could be modeled as a function of our multiple predictor variables, xi,X2, ■ ■ ■, xn, using the linear function Y = do + di3-’! + l%x2 + • • • + dfc^fcThe key assumptions for least squares are

  • • the relationship between dependent and independent variables is linear,
  • • errors are independent and normally distributed, and
  • • homoscedasticity[4] of the errors.

If any assumption is not satisfied, the model’s adequacy is questioned. In first courses, patterns seen or not seen in residual plots are used to gain information about a model’s adequacy. (See [AA1979], [D2012]).

Normality Assumption Lost

In logistic and Poisson regression, the response variable’s probability lies between 0 and 1. According to Neter [NKNW1996], this constraint loses both the normality and the constant variance assumptions listed above. Without these assumptions, the F and t tests cannot be used for analyzing the regression model. When this happens, transform the model and the data with a logistic transformation of the probability p, called logit p, to map the interval [0,1] to (—oo,+oo), eliminating the 0-1 constraint:

The /3s can now be interpreted as increasing or decreasing the “log odds” of an event, and exp(/3) (the “odds multiplier”) can be used as the odds ratio for a unit increase or decrease in the associated explanatory variable.

When the response variable is in the form of a count, we face a yet different constraint . Counts are all positive integers corresponding to rare events. Thus, a Poisson distribution (rather than a normal distribution) is more appropriate since the Poisson has a mean greater than 0, and the counts are all positive integers. Recall that the Poisson distribution gives the probability of у events occurring in time period t as

Then the logarithm of the response variable is linked to a linear function of explanatory variables.

Thus

In other words, a Poisson regression model expresses the “log outcome rate” as a linear function of the predictors, sometimes called “exposure variables.”

Assumptions in Poisson Regression

There are several key assumptions in Poisson regression that are different from those in the simple linear regression model. These assumptions include that the logarithm of the dependent variable changes linearly with equal incremental increases in the exposure variable; i.e., the relationship between the logarithm of the dependent variable and the independent variables is linear. For example, if we measure risk in exposure per unit time with one group as counts per month, while another is counts per years, we can convert all exposures to .strictly counts. We find that changes in the rate from combined effects of different exposures are multiplicative; i.e., changes in the log of the rate from combined effects of different exposures are additive. We find for each level of the covariates, the number of cases has variance equal to the mean, making it follow a Poisson distribution. Further, we assume the observations are independent.

Here, too, we use diagnostic methods to identify violations of the assumptions. To determine whether variances are too large or too small, plot residuals versus the mean at different levels of the predictor variables. Recall that in simple linear regression, one diagnostic of the model used plots of residuals against fits (fitted values). We will look for patterns in the residual or deviation plots as our main diagnostic tool for Poisson regression.

Poisson Regression Model

The basic model for Poisson regression is

The ith case mean response is denoted by it,, where u, can be one of many defined functions (Neter [NKNW1996]). We will only use the form

We assume that the Y, are independent Poisson random variables with expected value щ.

In order to apply regression techniques, we will use the likelihood function L (see [AA1979, D2012]) given by

Maximizing this function is intrinsically quite difficult. Instead, maximize the logarithm of the likelihood function shown below.

Numerical techniques are used to maximize ln(L) to obtain the best estimates for the coefficients of the model. Often, “good” starting points are required to obtain convergence to the maximum ([Fox2012]).

The deviations or residuals will be used to analyze the model. In Poisson regression, the deviance is given by

where щ is the fitted model; whenever Y, = 0, we set Y, ■ 1п(У)/г1;) = 0.

Diagnostic testing of the coefficients is carried out in the same fashion as for logistic regression. To estimate the variance-covariance matrix, use the Hessian matrix //(X), the matrix of second partial derivatives of the log- likelihood function ln(L) of (6.4). Then the approximated variance-covariance matrix is FC(X, В) = —//(X)-1 evaluated at B. the final estimates of the coefficients. The main diagonal elements of VC are estimates for the variance; the estimated standard deviations seg are the square roots of the main diagonal elements. Then perform hypothesis tests on the coefficients using t-tests. Two examples using the Hessian follow.

Example 6.9. Hessian-based Modeling.

Consider the model у; = exp(fo0 + /qa;,) for г =1. 2. ..., n.

Put this model into (6.4) to obtain The Hessian H = [fty] comes from

which gives the estimate of the variance-covariance matrix VC = —H^=g- For the two-parameter model (bo and b), the Hessian is

Change the model slightly adding a second independent variable with a third parameter. The model becomes у; = exp(6o -Mqaq, + 1)2*2;) for * = 1, 2, ..., n.

Compute the new Hessian and carefully note the similarities.

The pattern in the matrix is easily extended to obtain the Hessian for a model with n independent variables.

Let г/i = exp(fro + bixu + 62*2, + ■ ■ • + bnxrn). The general Poisson model Hessian is

Replace the formulas with numerical values from the data. The resulting symmetric square matrix should be non-singular. Compute the inverse of the negative of the Hessian matrix to find the variance-covariance matrix VC. The main diagonal entries of VC are the (approximate) variances of the estimated coefficients h,. The square roots of the entries on the main diagonal are the estimates of se(bi), the standard error for Ьг, to be used in the hypothesis testing with t* = b;/se(6;).

We now have all the information we need to build the tables for a Poisson regression that are similar to a regression program’s output .

Estimating the Regression Coefficients: Summary

The number of predictor variables plus one (for the constant term) gives the number of coefficients in the model у; = exp(£>o + bХц + Ь^хц H-----Ь bnxni).

Estimates of the 6,; are the final values from the numerical search method (if it converged) used to maximize the log-likelihood function ln(L) of (6.4). The values of se(bj), the standard error estimate for /;,. are the square roots of the main diagonal of the variance-covariance matrix VC = —//(X)^^. The values of t* = bi/se(bi) and the p-value, the probability P(T > |f*|). In the summary table of Poisson regression analysis below, let m be the number of variables in the model, and let к be the number of data elements of y, the dependent variable. A summary appears in Table 6.10.

TABLE 6.10: Poisson Regression Variables Summary

Degrees of Freedom

(df)

Deviance

Mean Deviance (MDev)

Ratio

Regression

Residual

Dres = result from the full model with m predictors

Total

Dt = result from reduced model у = eb°

Note that a prerequisite for using Poisson regression is that the dependent variable Y must be discrete counts with large numbers being a rare event.

We have chosen two data sets that have published solutions to be our basic examples. First, an outline of the procedure:

Step 0. Enter the data for X and Y.

Step 1. For Y:

  • (a) generate a histogram, and
  • (b) perform a chi-squared goodness-of-fit test for a Poisson distribution.11

If Y follows a Poisson distribution, then continue. If Y is “count data,” use Poisson regression regardless of the chi-squared test.

Step 2. Compute the value of bo in the constant model у = exp(feo) that minimizes (6.5); i.e., minimize two times the deviations.

Step 3. Compute the values of bo and b in the model у = exp(bo + l>x) that minimize the deviation (6.5).

Step 4. Interpret the results and the odds ratio.

We’ll step through an example following the outline above.

Example 6.10. Hospital Surgeries.

A group of hospitals has collected data on the numbers of Caesarean surgeries vs. the total number of births (see Table 6.11).[5]

TABLE 6.11: Total Births vs. Caesarean Surgeries

Total

3246

2750

2507

2371

1904

1501

1272

1080

1027

970

Special

26

24

21

21

21

20

19

18

18

17

Total

739

679

502

236

357

309

192

138

100

95

Special

17

16

16

16

16

15

14

14

13

13

Use the hospitals’ data set to perform a Poisson regression following the steps listed above.

Step 0. Enter the data.

Step 1. Plot a histogram, and then perform a Chi-square Goodness-of-fit test on yhc, if appropriate.

(Note: Maple’s Histogram function is in the Statistics package. There are a large number of options for binning the data; we will use frequency scale = absolute to have the heights of the bars equal to the frequency of entries in the associated bin. Collect the bin counts with Tallylnto.)

Now for the chi-squared test. First, generate the predicted values from an estimated Poisson distribution.

We are ready to use Maple’s chi-squared test, ChiSquareGoodnessOfFitTest, with a significance level of 0.05. Use the summarize = embed option, as it produces the most readable output. The command is terminated with a colon: “embedding the output” makes it unnecessary to return a result.

The chi-squared test indicates that a Poisson distribution is reasonable.

Step 2. Find the best constant model у = exp(fco).

Let’s use Maple’s LinearFit on the function Y = ln(y) = b<).

Step 3. Find the best exponential model у = exp(feo + bix). Let’s use Maple’s ExponentialFit to find the model.

Step 4. Conclude by calculating the odds-ratio.

Use the odds-multiplier exp(/?i) as the approximate odds-ratio, often called risk-ratio for Poisson regression.

OR represents the potential increase resulting from one unit increase in x. (How does this concept relate to “opportunity cost” in linear programming and “marginal revenue” in economics?)

Return to the Philippines example relating literacy and violence described in the opening of this chapter.

Example 6.11. Violence in the Philippines.

The number of significant acts of violence, SigActs in Table 6.12, are integer counts.[6]

TABLE 6.12: Literacy Rate (Lit) vs. Significant Acts of Violence (SigActs), Philippines, 2008.

Province

Lit

SigActs

Province

Lit

SigActs

Basnlan

71.6

29

Drnagat Istands

85.7

0

Larseao del Sur

71.6

30

Sungapdel Norte

85.7

10

Maguindanso

71.6

122

Sungapdel Sur

85.7

31

Suu

71.6

26

Bukidnon

85.9

14

Tawi-Tawi

71.6

1

Camigum

85.9

0

Bihran

72.9

0

Laraodel Norte

85.9

57

Eastern Samar

72.9

11

Misamis Occidental

85.9

8

Leyte

72.9

2

Misamis Onental

85.9

7

Northern Samar

72.9

23

Batanes

86.1

0

Southern Leyte

72.9

0

Cagayan

86.1

15

Western Samar

72.9

64

Isabela

86.1

4

North Cotabato

78.3

125

Nueva Vizcaya

86.1

3

Sarangani

78.3

23

Quirmo

86.1

0

South Cotabato

78.3

5

Bokal

86.6

2

Suan Ku:iarat

78.3

18

Cebu

86.6

0

Zamboanga del Norte

79.6

8

Negros Onertal

86.6

27

Zamboarga del Sur

79.6

10

Siquyjor

86.6

0

Zamboanga Sibugay

79.6

3

Abra

89.2

11

Albey

79.9

35

Apayap

89.2

0

Camarines Norte

79.9

12

Benguet

89.2

0

Camarines Sur

79.9

44

Ifugao

89.2

0

Caanduancs

79.9

9

Kahinga

89.2

11

Masbate

79.9

42

Mountain Province

89.2

0

Sorsogon

79.9

52

Veces Norte

91.3

0

Compostela Valtey

81.7

126

Lvees Sur

91.3

2

Davaodcl Norte

81.7

35

La Unon

91.3

0

Davaedel Sur

81.7

64

Pangasman

91.3

0

Davao Orental

81.7

40

Aurora

92.1

10

Aklan

82.6

0

Bataan

92.1

1

Artque

82.6

1

Bulacan

92.1

6

Capuz

82.6

8

Nueva Ecya

92.1

4

Guimaras

82.6

0

Pampenga

92.1

3

Iloilo

82.6

8

Tarlac

92.1

4

Negros Occidental

82.6

26

Zambales

92.1

6

Marinduque

83.9

0

Batangas

93.5

5

Occedemta Mindoro

83.9

5

Cavric

93.5

0

Onental Mindoro

83.9

7

Laguna

93.5

4

Palawan

83.9

2

Quezon

93.5

28

Romblon

83.9

0

Rizal

93.5

3

Agusandel Norte

85.7

13

Metropolzian Manila

94

1

Aguxandel Sur

85.7

33

The literacy data has been defined as L, the SigActs as V. Examine the histogram in Figure 6.6 to see that the data appears to follow a Poisson distribution. A goodness-of-fit test (left as an exercise) confirms the data follows a Poisson distribution.

Histogram of SigActs Data

FIGURE 6.6: Histogram of SigActs Data

Use Maple to fit the data. First, remove the three outlier data points with values well over 100, as there are other much more significant generators of violence beyond literacy levels in those regions. We cannot use Maple’s ExponentialFit, as it attempts a log-transformation of SigActs which fails due to 0 values.

Plot the fit.

We accept that the fit looks pretty good.

The odds multiplier, ebl, for our fit is e05u437 ~ 0.946 which means that for every 1 unit increase in literacy we expect violence to go down « 5.4%. This value suggests improving literacy will help ameliorate the violence.

Poisson Regression with Multiple Predictor Variables in Maple

Often, there are many variables that influence the outcome under study. We’ll add a second predictor to the Hospital Births problem.

Example 6.12. Hospital Births Redux.

Revisit Example 6.10 with an additional predictor: the type of hospital, rural (0) or urban (1). the new data appears in Table 6.13.

TABLE 6.13: Total Births vs. Caesarean Surgeries and Hospital Type

Total

3246

2750

2507

2371

1904

1501

1272

1080

1027

970

Special

26

24

21

21

21

20

19

18

18

17

Type

1

1

1

1

1

1

1

1

1

1

Total

739

679

502

236

357

309

192

138

100

95

Special

17

16

16

16

16

15

14

14

13

13

Type

1

1

1

1

1

0

1

0

0

0

The data has been entered as B: Total, C: Special, and T: Type. After loading the Statistics package, define the model.

Collect the data and use NonlinearFit to fit the model.

Finishing the statistical analysis of the model is left as an exercise.

Exercises

  • 1. Adjust the nonlinear model for Afghanistan casualties, Example 6.5, to increase the amplitude of the sine term more quickly. How does the conclusion change, if at all?
  • 2. Investigate the action of parameters in the logistic function by executing the Maple statements below using the Explore command to make an interactive graph.

3. For the data in Table 6.14 (a) plot the data and (b) state the type of regression that should be used to model the data.

TABLE 6.14: Tire Tread Data

Number

Hours

Tread (cm)

1

2

5.4

2

5

5.0

3

7

4.5

4

10

3.7

5

14

3.5

6

19

2.5

7

26

2.0

8

31

1.6

9

34

1.8

10

38

1.3

11

45

0.8

12

52

1.1

13

53

0.8

14

60

0.4

15

65

0.6

4. Assume the suspected nonlinear model for the data of Table 6.15 is If we use a log-log transformation, we obtain

Use regression techniques to estimate the parameters a, b, and c, and statistically analyze the resulting coefficients.

TABLE 6.15: Nonlinear Data

X

У

Z

101

15

0.788

73

3

304.149

122

5

98.245

56

20

0.051

107

20

0.270

77

5

30.485

140

15

1.653

66

16

0.192

109

5

159.918

103

14

1.109

93

3

699.447

98

4

281.184

76

14

0.476

83

5

54.468

113

12

2.810

167

6

144.923

82

5

79.733

85

6

21.821

103

20

0.223

86

11

1.899

67

8

5.180

104

13

1.334

114

5

110.378

118

21

0.274

94

5

81.304

  • 5. Using the basic linear model у = j3o + f3x, fit the following data sets. Provide the model, the analysis of variance information, the value of R2, and a residual plot.
  • (а)

X

100

125

125

150

150

200

200

У

150

140

180

210

190

320

280

X

250

250

300

300

350

400

400

У

400

430

440

390

600

610

670

(b) The following data represents change in growth where x is body weight and у is normalized metabolic rate for 13 animals.

X

no

115

120

230

235

240

360

У

198

173

174

149

124

115

130

X

362

363

500

505

510

515

У

102

95

122

112

98

96

6. Use an appropriate multivariable-model for the following ten observations of college acceptances to graduate school of GRE score, high school GPA, highly selective college, and whether the student was admitted. 1 indicates “Yes” and 0 indicates “No.”

GPA

GRE

Selective

Admitted

3.61

380

0

1

3.67

660

1

0

4.00

800

1

0

3.19

640

0

0

2.93

520

0

1

3.00

760

0

0

2.98

560

0

0

3.08

400

0

1

3.39

540

0

0

3.92

700

1

1

7. The data set for lung cancer in relation to cigarette smoking in Table 6.16 is from Frome, Biometrics 39, 1983, pg. 665-674. The number of person years in parentheses is broken down by age and daily cigarette consumption. Find and analyze an appropriate multivariate model.

TABLE 6.16: Lung Cancer Rates for Smokers and Nonsmokers

Age

Number Smoked per day

Nonsmokers

1-9

10-14

15-19

20-24

25-34

> 35

15-20

1 (10366)

0 (3121)

0 (3577)

0 (4319)

0 (5683)

0 (3042)

0 (670)

20-25

0 (8162)

0 (2397)

1 (3286)

0 (4214)

1 (6385)

1 (4050)

0 (1166)

25-30

0 (5969)

0 (2288)

1 (2546)

0 (3185)

1 (5483)

4 (4290)

0 (1482)

30-35

0 (4496)

0 (2015)

2 (2219)

4 (2560)

6 (4687)

9 (4268)

4 (1580)

35-40

0 (3152)

1 (1648)

0 (1826)

0 (1893)

5 (3646)

9 (3529)

6 (1136)

40-45

0 (2201)

2 (1310)

1 (1386)

2 (1334)

12 (2411)

11 (2424)

10 (924)

45-50

0 (1421)

0 (927)

2 (988)

2 (849)

9 (1567)

10 (1409)

7 (556)

50-55

0 (1121)

3 (710)

4 (684)

2 (470)

7 (857)

5 (663)

4 (255)

>55

2 (826)

0 (606)

3 (449)

5 (280)

7 (416

3 (284)

1 (104)

8. Model absences from class where:

School: school 1 or school 2 Gender: female is 1, male is 2 Ethnicity: categories 1 through 6 Math Test: score Language Test: score

Bilingual: categories 1 through 4

School

Gender

Ethnicity

Math Score

Lang. Score

Bilingual

Status

Days

Absent

1

2

4

56.98

42.45

2

4

1

2

4

37.09

46.82

2

4

2

1

4

32.37

43.57

2

2

1

1

4

29.06

43.57

2

3

2

1

4

6.75

27.25

3

3

1

1

4

61.65

48.41

0

13

1

1

4

56.99

40.74

2

11

2

2

4

10.39

15.36

2

7

1

2

4

50.52

51.12

2

10

1

2

6

49.47

42.45

0

9

Projects

Project 1. Fit, analyze, and interpret your results for the nonlinear model у = a th with the data provided below. Produce fit plots and residual graphs with your analysis.

Project 2. Fit, analyze, and interpret your results for an appropriate model with the data provided below. Produce fit plots and residual graphs with your analysis.

Year

0 1 2

3

4 5

6

7

8

9

10

Quantity

15 150 250

275

270 280

290

650

1200

1550

2750

t

7

14

21

28

35

42

У

8

41

133

250

280

297

Project 3. Fit, analyze, and interpret your results for the nonlinear model у = atb with the data provided by executing the Maple code below. Produce fit plots and residual graphs with your analysis. Use your phone number (no dashes or parentheses) for PN.

  • [1] See David L. Smith, Less Than Human: Why We Demean, Enslave, and ExterminateOthers.
  • [2] E. Melander, M. Oberg, and J. Hall, “The ‘New Wars’ Debate Revisited: An Empirical Evaluation of the Atrociousness of ‘New Wars’,” Uppsala Univ. Press, Uppsala, 2006.Available at www.pcr.uu.se/digitalAssets/654/c_654444-l_l-k_uprp_no_9.pdf.
  • [3] J. Kreutz, “How and When Armed Conflicts End: Introducing the UCDP ConflictTermination Dataset,” J. Peace Research, 47(2), 2010, 243-250.
  • [4] ’“Homoscedasticity: All random variables have the same finite variance.
  • [5] Adaptecl from “Research Methods II: Multivariate Analysis,” J. Trop. Pediatrics,Online Feature, (2009), pp. 136-143. Originally at: www.oxfordjournals.org/our_journals/tropej/online/ma_chapl3.pdf.
  • [6] 1:iData sources: National Statistics Office (Manila, Philipppines) and the Archives of theArmed Forces of the Philippines.
 
<<   CONTENTS   >>

Related topics