Home Mathematics



Elementary Tasks in Frequentist InferenceTable of Contents:
This section reviews elementary ffequentist tasks of hypothesis testing and producing confidence intervals. Hypothesis TestingStatisticians are often asked to choose among potential hypotheses about the mechanism generating a set of data. This choice is often phrased as between a null hypothesis, generally implying the absence of a potential effect, and an alternative hypothesis, generally the presence of a potential effect. These hypotheses in turn are expressed in terms of sets of probability distributions, or equivalently, in terms of restrictions on a summary (such as the median) of a probability distribution. A hypothesis test is a rule that takes a data set and returns “Reject the null hypothesis that в = в^{(1}" or “Do not reject the null hypothesis.” In many applications studied in this manuscript, hypothesis tests are implemented by constructing a test statistic T, depending on data, and a constant f°, such that the statistician rejects the null hypothesis (1.6) if T > t°, and fails to reject it otherwise. The constant t° is called a critical value, and the collection of data sets in
for which the null hypothesis is rejected is called the critical region. OneSided Hypothesis TestsFor example, if data represent the changes in some physiological measure after receiving some therapy, measured on subjects acting independently, then a null hypothesis might be that each of the changes in measurements comes from a distribution with median zero, and the alternative hypothesis might be that each of the changes in measurements comes from a distribution with median greater than zero. In symbols, if в represents the median of the distribution of changes, then the null hypothesis is в = 0. and the alternative is
Such an alternative hypothesis is typically called a onesided hypothesis. The null hypothesis might then be thought of as the set of all possible distributions for observations with median zero, and the alternative is the set of all possible distributions for observations with positive median. If larger values of в make larger values of T more likely, and smaller values less likely, then a test rejecting the null hypothesis for data in (1.5) is reasonable. Because null hypotheses are generally smaller in dimension than alternative hypotheses, frequencies of errors are generally easier to control for null hypotheses than for alternative hypotheses. Tests are constructed so that, in cases in which the null hypothesis is actually true, it is rejected with no more than a fixed probability. This probability is called the test level or type I error rate, and is commonly denoted a. Hence the critical value in such cases is defined to be the smallest value t° satisfying When the distribution of T is continuous then the critical value satisfies (1.7) with < replaced by equality. Many applications in this volume feature test statistics with a discrete distribution; in this case, the < in (1.7) is generally <. The other type of possible error occurs when the alternative hypothesis is true, but the null hypothesis is not rejected. The probability of erroneously failing to reject the null hypothesis is called the type II error rate, and is denoted /3. More commonly, the behavior of the test under an alternative hypothesis is described in terms of the probability of a correct answer, rather than of an error; this probability is called power; power is 1 — /3. One might attempt to control power as well, but, unfortunately, in the common case in which an alternative hypothesis contains probability distributions arbitrarily close to those in the null hypothesis, the type II error rate will come arbitrarily close to one minus the test level, which is quite large. Furthermore, for a fixed sample size and mechanism for generating data, once a particular distribution in the alternative hypothesis is selected, the smallest possible type II error rate is fixed, and cannot be independently controlled. Hence generally, tests are constructed primarily to control level. Under this paradigm, then, a test is constructed by specifying a, choosing a critical value to give the test this type I error, and determining whether this test rejects the null hypothesis or fails to reject the null hypothesis. In the onesided alternative hypothesis formulation {0 > в^{0}}, the investigator is, at least in principle, interested in detecting departures from the null hypothesis that vary in proximity to the null hypothesis. (The same observation will hold for twosided tests in the next subsection). For planning purposes, however, investigators often pick a particular value within the alternative hypothesis. This particular value might be the minimal value of practical interest, or a value that other investigators have estimated. The}' then calculate the power at this alternative, to ensure that it is large enough to meet their needs. A power that is too small indicates that there is a substantial chance that the investigator’s alternative hypothesis is correct, but that they will fail to demonstrate it. Powers near 80% are typical targets. Consider a test with a null hypothesis of form 0 = 0° and an alternative hypothesis of form 0 = 0 a, using a statistic T such that under the null hypothesis T ~ ^{а}л) Test with a level a, and without loss of generality assume that 0a > 0° In this case, the critical value is approximately t° = 0° + aoz_{a} Here z_{a} is the number such that Ф(г_{а}) = 1 — a. A common level for such a onesided test is a = 0.025; «0.025 = 1.96. The power for such a onesided test is
One might plan an experiment by substituting null hypothesis values 0° and One can then ask whether this effect size is plausible. More commonly, TwoSided Hypothesis TestsIn contrast to onesided alternatives, consider a twosided hypothesis for в ф в^{0}, and, in this case, one often uses a test that rejects the null hypothesis for data sets in
In this case, there are two critical values, chosen so that
and so that the first inequality is close to equality. Many of the statistics constructed in this volume require an arbitrary choice by the analyst of direction of effect, and choosing the direction of effect differently typically changes the sign of T. In order keep the analytical results invariant to this choice of direction, the critical values are chosen to make the two final probabilities in (1 11) equal. Then the critical values solve the equations with t chosen as small as possible, and t°_{L} chosen as large as possible consistent with (1.12). Comparing with (1.7), the twosided critical value is calculated exactly as is the onesided critical value for an alternative hypothesis in the appropriate direction, and for a test level half that of the onesided test. Hence a twosided test of level 0.05 is constructed in the same way as two onesided tests of level 0.025. Often the critical region implicit in (1.11) can be represented by creating a new statistic that is large when T is either large or small. That is, one might set W = T — Ee^{u} [T] . In this case, if t°_{L} and fj) ^{are} symmetric about E
for w° = tfj — E#o [Т]. Alternatively, one might define W = T — Ее» [T] ^{2}, and, under the same symmetry condition, use as the critical region (1.13) for w° = (tц — E»^{11} [T])^{2}. In the absence of such a symmetry condition, w° may be calculated from the distribution of W directly, by choosing w° to make the probability of the set in (1.13) equal to a. Statistics T for which (110) is a reasonable critical region are inherently onesided, since the twosided test is constructed from onesided tests combining evidence pointing in opposite directions. Similarly, statistics W for which (1.13) is a reasonable critical region for the twosided alternative are inherently twosided. Power for the twosided test is the same probability as calculated in (1.11), with the в a substituted for в^{0}, and power substituted for a. Again, assume that large values of в make larger values of T more likely. Then, for alternatives 0, greater than в^{0}, the first probability added in
is quite small, and is typically ignored for power calculations. Additionally, rejection of the null hypothesis because the evidence is in the opposite direction of that anticipated will result in conclusions from the experiment not comparable to those for which the power calculation is constructed. Hence power for the twosided test is generally approximated as the power for the onesided test with level half that of the twosided tests, and a in (1.8) is often 0.025, corresponding to half of the twosided test level. Some tests to be constructed in this volume may be expressed as W = T.U ^{u}f^{f}™ variables Uj which are, under the null hypothesis, approximately standard Gaussian and independent; furthermore, in such cases, the critical region for such tests is often of the form {W > iu^{0}}. In such cases, the test of level a rejects the null hypothesis when W > Xk a f°^{r} Xk a the 1 — a quantile of the xf distribution. If the standard Gaussian approximation for the distribution of Uj is only approximately correct, then the resulting test will have level a approximately, but not exactly. If, under the alternative hypothesis, the variables Uj have expectations fij and standard deviations £.j, the alternative distribution of W will be a complicated weighted sum of Xi(Uj) variables. Usually, however, the impact of the move from the null distribution to the alternative distribution is much higher on the component expectations than on the standard deviations, and one might treat these alternative standard deviations fixed at 1. With this simplification, the sampling distribution of W under the alternative is XfcCCb=i > the noncentral chisquare distribution. PvaluesAlternatively, one might calculate a test statistic, and determine the test level at which one transitions from rejecting to not rejecting the null hypothesis. This quantity is called a pvalue. For a onesided test with critical region of form (1.5), the pvalue is given by
for t_{0}bs the observed value of the test statistic. For twosided critical values of form (1.10), with condition (1.12), the pvalue is given by
These pvalues are interpreted as leading to rejection of the null hypothesis when they are as low as or lower than the test level specified in advance by the investigator before data collection. Inferential procedures that highlight pvalues are indicative of the inferential approach of Fisher (1925), while those that highlight prespecified test levels and powers are indicative of the approach of Neyman and Pearson (1933). I refer readers to a thorough survey (Lehmann, 1993), and note here only that while I find the prespecified test level arguments compelling, problematic examples leading to undesirable interpretations of pvalues are rare using the techniques developed in this volume, and, more generally, the contrasts between techniques advocated by these schools of thought are not central to the questions investigated here. Confidence IntervalsA confidence interval of level 1 — a for parameter 9 is defined as a set (L. U) such that L and U depend on data, and such that for any 9,
The most general method for constructing a confidence interval is test inversion. For every possible null value 9°, find a test of the null hypothesis 9 = 9°, versus the twosided alternative, of level no larger than a. Then the confidence set is
In many cases, (116) is an interval. In such cases, one attempts to determine the lower and upper bounds of the interval, either analytically or numerically. Often, such tests are phrased in terms of a quantity W{9) depending on both the data and the parameter, such that the test rejects the null hypothesis 9 = 9° if and only if W($°) > w°(9°) for some critical value c that might depend on the null hypothesis. Pvalue InversionOne might construct confidence intervals through tail probability inversion. Suppose that one can find a univariate statistic T whose distribution depends on the unknown parameter 0. such that potential onesided pvalues are monotonic in 9 for each potential statistic value t. Typical applications have
with probabilities in (117) continuous in 9. Let t be the observed value of T. Under (1.17), is an interval, of form (9^{L},9^{U}), with endpoints satisfying There may be t such that the equation P^l [T > t] = a/2 has no solution, because P# [T > t] > a/2 for all 9. In such cases, take 9^{L} to be the lower bound on possible values for 9. For example, if 7Г € [0,1], and T ~ J3in(n, тг), then Pjr [T > 0] = 1 > a/2 for all 7Г, P_{T} [T > 0] = a/2 has no solution, and n^{L} = 0. Alternatively, if 9 can take any real value, and T ~ Sin(n, exp(0)/(l + exp(#))), then P# [T > 0] = a/2 has no solution, and 9^{L} = — oc. Similarly, there may be t such that the equation P_{0}<; [T < t] = a/2 has no solution, because Pg T < t] > a/2 for all в. In such cases, take 0^{l} to be the upper bound on possible values for в. Construction of intervals for the binomial proportion represents a simple example in which pvalues may be inverted (Clopper and Pearson, 1934). Test Inversion with Pivotal StatisticsConfidence interval construction is simplified when there exists a random quantity, generally involving an unknown parameter в, with a distribution that does not depend on 9. Such a quantity is called a pivot. For instance, in the case of independent and identically distributed observations with average A and standard deviation s from a €>(9,u^{2}) distribution, then T = (X — 9)/(s//n) has a t distribution with n — 1 degrees of freedom, regardless of 9. One may construct a confidence interval using a pivot by finding quantiles t°_{L} and t/j such that
Then
is a confidence interval, if it is really an interval. In the case when (1.20) is an interval, and when T(9, data) is continuous in 9, then the interval is of the form (L, U): that is, the interval does not include the endpoints. A Problematic ExampleOne should use this test inversion technique with care, as the following problematic case shows. Suppose that X and Y are Gaussian random variables, with expectations p and и respectively, and common known variances
for V = CrZ_{a}/2 If X^{2} + Y^{2} < v^{2}, then Q{p) in (1.21) has a negative coefficient for p^{2}, and the maximum value is at p = XY/{X^{2} — v^{2}). The maximum is {v^{2} (V^{2} + X^{2} + r^{2}))/(v^{2}  X^{2}) < 0, and so the inequality in (1.21) holds for all p, and the confidence interval is the entire real line. If X^{2} + Y^{2} > v^{2} > X^{2}, then the quadratic form in (1.21) has a negative coefficient for p^{2}, and the maximum is positive. Hence values satisfying the inequality in (1.21) are very large and very small values of p: that is, the confidence interval is
If X^{2} > v^{2}, then the quadratic form in (1.21) has a positive coefficient for p^{2}, and the minimum is negative. Then the values of p satisfying the inequality in (1.21) are those near the minimizer XYj(X^{2} — v^{2}). Hence the interval is Exercises1. Demonstrate that the moment generating function for the statistic (1.3), under (1.2), depends on Si,...,only through X!j=i Щ■ 
<<  CONTENTS  >> 

Related topics 