Home Mathematics

# Elementary Tasks in Frequentist Inference

This section reviews elementary ffequentist tasks of hypothesis testing and producing confidence intervals.

## Hypothesis Testing

Statisticians are often asked to choose among potential hypotheses about the mechanism generating a set of data. This choice is often phrased as between a null hypothesis, generally implying the absence of a potential effect, and an alternative hypothesis, generally the presence of a potential effect. These hypotheses in turn are expressed in terms of sets of probability distributions, or equivalently, in terms of restrictions on a summary (such as the median) of a probability distribution. A hypothesis test is a rule that takes a data set and returns “Reject the null hypothesis that в = в(1" or “Do not reject the null hypothesis.”

In many applications studied in this manuscript, hypothesis tests are implemented by constructing a test statistic T, depending on data, and a constant f°, such that the statistician rejects the null hypothesis (1.6) if T > t°, and fails to reject it otherwise. The constant is called a critical value, and the collection of data sets in

for which the null hypothesis is rejected is called the critical region.

### One-Sided Hypothesis Tests

For example, if data represent the changes in some physiological measure after receiving some therapy, measured on subjects acting independently, then a null hypothesis might be that each of the changes in measurements comes from a distribution with median zero, and the alternative hypothesis might be that each of the changes in measurements comes from a distribution with median greater than zero. In symbols, if в represents the median of the distribution of changes, then the null hypothesis is в = 0. and the alternative is

Such an alternative hypothesis is typically called a one-sided hypothesis. The null hypothesis might then be thought of as the set of all possible distributions for observations with median zero, and the alternative is the set of all possible distributions for observations with positive median. If larger values of в make larger values of T more likely, and smaller values less likely, then a test rejecting the null hypothesis for data in (1.5) is reasonable.

Because null hypotheses are generally smaller in dimension than alternative hypotheses, frequencies of errors are generally easier to control for null hypotheses than for alternative hypotheses.

Tests are constructed so that, in cases in which the null hypothesis is actually true, it is rejected with no more than a fixed probability. This probability is called the test level or type I error rate, and is commonly denoted a.

Hence the critical value in such cases is defined to be the smallest value satisfying

When the distribution of T is continuous then the critical value satisfies (1.7) with < replaced by equality. Many applications in this volume feature test statistics with a discrete distribution; in this case, the < in (1.7) is generally <.

The other type of possible error occurs when the alternative hypothesis is true, but the null hypothesis is not rejected. The probability of erroneously failing to reject the null hypothesis is called the type II error rate, and is denoted /3. More commonly, the behavior of the test under an alternative hypothesis is described in terms of the probability of a correct answer, rather than of an error; this probability is called power; power is 1 — /3.

One might attempt to control power as well, but, unfortunately, in the common case in which an alternative hypothesis contains probability distributions arbitrarily close to those in the null hypothesis, the type II error rate will come arbitrarily close to one minus the test level, which is quite large. Furthermore, for a fixed sample size and mechanism for generating data, once a particular distribution in the alternative hypothesis is selected, the smallest possible type II error rate is fixed, and cannot be independently controlled. Hence generally, tests are constructed primarily to control level. Under this paradigm, then, a test is constructed by specifying a, choosing a critical value to give the test this type I error, and determining whether this test rejects the null hypothesis or fails to reject the null hypothesis.

In the one-sided alternative hypothesis formulation {0 > в0}, the investigator is, at least in principle, interested in detecting departures from the null hypothesis that vary in proximity to the null hypothesis. (The same observation will hold for two-sided tests in the next subsection). For planning purposes, however, investigators often pick a particular value within the alternative hypothesis. This particular value might be the minimal value of practical interest, or a value that other investigators have estimated. The}' then calculate the power at this alternative, to ensure that it is large enough to meet their needs. A power that is too small indicates that there is a substantial chance that the investigator’s alternative hypothesis is correct, but that they will fail to demonstrate it. Powers near 80% are typical targets.

Consider a test with a null hypothesis of form 0 = 0° and an alternative hypothesis of form 0 = 0 a, using a statistic T such that under the null hypothesis T ~ ал)- Test with a level a, and without loss of generality assume that 0a > 0°- In this case, the critical value is approximately t° = 0° + aoza- Here za is the number such that Ф(га) = 1 — a. A common level for such a one-sided test is a = 0.025; «0.025 = 1.96. The power for such a one-sided test is

One might plan an experiment by substituting null hypothesis values and a a into (1.8), and verifying that this power is high enough to meet the investigator’s needs; alternatively, one might require power to be 1 — 13, and solve for the effect size necessary to give this power. This effect size is

One can then ask whether this effect size is plausible. More commonly, a a both may be made to depend on a parameter representing sample size, with both decreasing to zero as sample size increases. Then (1.8) is solved for sample size.

### Two-Sided Hypothesis Tests

In contrast to one-sided alternatives, consider a two-sided hypothesis for в ф в0, and, in this case, one often uses a test that rejects the null hypothesis for data sets in

In this case, there are two critical values, chosen so that

and so that the first inequality is close to equality.

Many of the statistics constructed in this volume require an arbitrary choice by the analyst of direction of effect, and choosing the direction of effect differently typically changes the sign of T. In order keep the analytical results invariant to this choice of direction, the critical values are chosen to make the two final probabilities in (1 11) equal. Then the critical values solve the equations

with t- chosen as small as possible, and L chosen as large as possible consistent with (1.12). Comparing with (1.7), the two-sided critical value is calculated exactly as is the one-sided critical value for an alternative hypothesis in the appropriate direction, and for a test level half that of the one-sided test. Hence a two-sided test of level 0.05 is constructed in the same way as two one-sided tests of level 0.025.

Often the critical region implicit in (1.11) can be represented by creating a new statistic that is large when T is either large or small. That is, one might set W = |T — Eeu [T] |. In this case, if L and fj) are symmetric about E

for w° = tfj — E#o [Т]. Alternatively, one might define W = T — Ее» [T] |2, and, under the same symmetry condition, use as the critical region (1.13) for = (t-ц — E»11 [T])2. In the absence of such a symmetry condition, may be calculated from the distribution of W directly, by choosing to make the probability of the set in (1.13) equal to a.

Statistics T for which (110) is a reasonable critical region are inherently one-sided, since the two-sided test is constructed from one-sided tests combining evidence pointing in opposite directions. Similarly, statistics W for which (1.13) is a reasonable critical region for the two-sided alternative are inherently two-sided.

Power for the two-sided test is the same probability as calculated in (1.11), with the в a substituted for в0, and power substituted for a. Again, assume that large values of в make larger values of T more likely. Then, for alternatives 0, greater than в0, the first probability added in

is quite small, and is typically ignored for power calculations. Additionally, rejection of the null hypothesis because the evidence is in the opposite direction of that anticipated will result in conclusions from the experiment not comparable to those for which the power calculation is constructed. Hence power for the two-sided test is generally approximated as the power for the one-sided test with level half that of the two-sided tests, and a in (1.8) is often 0.025, corresponding to half of the two-sided test level.

Some tests to be constructed in this volume may be expressed as W = T.U uff™ variables Uj which are, under the null hypothesis, approximately standard Gaussian and independent; furthermore, in such cases, the critical region for such tests is often of the form {W > iu0}. In such cases, the test of level a rejects the null hypothesis when W > Xk a-r Xk a the 1 — a quantile of the xf distribution. If the standard Gaussian approximation for the distribution of Uj is only approximately correct, then the resulting test will have level a approximately, but not exactly.

If, under the alternative hypothesis, the variables Uj have expectations fij and standard deviations £.j, the alternative distribution of W will be a complicated weighted sum of Xi(Uj) variables. Usually, however, the impact of the move from the null distribution to the alternative distribution is much higher on the component expectations than on the standard deviations, and one might treat these alternative standard deviations fixed at 1. With this simplification, the sampling distribution of W under the alternative is XfcCCb=i > the non-central chi-square distribution.

### P-values

Alternatively, one might calculate a test statistic, and determine the test level at which one transitions from rejecting to not rejecting the null hypothesis. This quantity is called a p-value. For a one-sided test with critical region of form (1.5), the p-value is given by

for t0bs the observed value of the test statistic. For two-sided critical values of form (1.10), with condition (1.12), the p-value is given by

These p-values are interpreted as leading to rejection of the null hypothesis when they are as low as or lower than the test level specified in advance by the investigator before data collection.

Inferential procedures that highlight p-values are indicative of the inferential approach of Fisher (1925), while those that highlight pre-specified test levels and powers are indicative of the approach of Neyman and Pearson (1933). I refer readers to a thorough survey (Lehmann, 1993), and note here only that while I find the pre-specified test level arguments compelling, problematic examples leading to undesirable interpretations of p-values are rare using the techniques developed in this volume, and, more generally, the contrasts between techniques advocated by these schools of thought are not central to the questions investigated here.

## Confidence Intervals

A confidence interval of level 1 — a for parameter 9 is defined as a set (L. U) such that L and U depend on data, and such that for any 9,

The most general method for constructing a confidence interval is test inversion. For every possible null value 9°, find a test of the null hypothesis 9 = 9°, versus the two-sided alternative, of level no larger than a. Then the confidence set is

In many cases, (116) is an interval. In such cases, one attempts to determine the lower and upper bounds of the interval, either analytically or numerically.

Often, such tests are phrased in terms of a quantity W{9) depending on both the data and the parameter, such that the test rejects the null hypothesis 9 = if and only if W(\$°) > w°(9°) for some critical value c that might depend on the null hypothesis.

### P-value Inversion

One might construct confidence intervals through tail probability inversion. Suppose that one can find a univariate statistic T whose distribution depends on the unknown parameter 0. such that potential one-sided p-values are monotonic in 9 for each potential statistic value t. Typical applications have

with probabilities in (117) continuous in 9. Let t be the observed value of T. Under (1.17),

is an interval, of form (9L,9U), with endpoints satisfying

There may be t such that the equation P^l [T > t] = a/2 has no solution, because P# [T > t] > a/2 for all 9. In such cases, take 9L to be the lower bound on possible values for 9. For example, if 7Г € [0,1], and T ~ J3in(n, тг), then Pjr [T > 0] = 1 > a/2 for all 7Г, PT [T > 0] = a/2 has no solution, and nL = 0. Alternatively, if 9 can take any real value, and T ~ Sin(n, exp(0)/(l + exp(#))), then P# [T > 0] = a/2 has no solution, and 9L = — oc. Similarly, there may be t such that the equation P0<; [T < t] = a/2 has no solution, because Pg T < t] > a/2 for all в. In such cases, take 0l to be the upper bound on possible values for в.

Construction of intervals for the binomial proportion represents a simple example in which p-values may be inverted (Clopper and Pearson, 1934).

### Test Inversion with Pivotal Statistics

Confidence interval construction is simplified when there exists a random quantity, generally involving an unknown parameter в, with a distribution that does not depend on 9. Such a quantity is called a pivot. For instance, in the case of independent and identically distributed observations with average A and standard deviation s from a €>(9,u2) distribution, then T = (X — 9)/(s//n) has a t distribution with n — 1 degrees of freedom, regardless of 9.

One may construct a confidence interval using a pivot by finding quantiles L and t/j such that

Then

is a confidence interval, if it is really an interval. In the case when (1.20) is an interval, and when T(9, data) is continuous in 9, then the interval is of the form (L, U): that is, the interval does not include the endpoints.

### A Problematic Example

One should use this test inversion technique with care, as the following problematic case shows. Suppose that X and Y are Gaussian random variables, with expectations p and и respectively, and common known variances 2. Suppose that one desires a confidence interval for p = p jи (Fieller, 1954). The quantity T = /ri(X — pY)/(oJ + p2) has a standard Gaussian distribution, independent of p, and hence is pivotal. A confidence region is {p : n(X — pY)2/cr2(l + p2)) < Equivalently, the region is

for V = CrZa/2-

If X2 + Y2 < v2, then Q{p) in (1.21) has a negative coefficient for p2, and the maximum value is at p = XY/{X2 — v2). The maximum is

{v2 (-V2 + X2 + r2))/(v2 - X2) < 0, and so the inequality in (1.21) holds for all p, and the confidence interval is the entire real line.

If X2 + Y2 > v2 > X2, then the quadratic form in (1.21) has a negative coefficient for p2, and the maximum is positive. Hence values satisfying the inequality in (1.21) are very large and very small values of p: that is, the confidence interval is

If X2 > v2, then the quadratic form in (1.21) has a positive coefficient for p2, and the minimum is negative. Then the values of p satisfying the inequality in (1.21) are those near the minimizer XYj(X2 — v2). Hence the interval is

# Exercises

1. Demonstrate that the moment generating function for the statistic (1.3), under (1.2), depends on Si,...,only through X!j=i Щ■

 Related topics