Home Mathematics



Empirical Levels and Powers of TwoSample TestsTable of Contents:
As in Table 2.1, one might simulate data from a variety of distributions, and compare levels of the various twosample tests. Results are in Table 3.4. Table 3.4 shows that the extreme conservativeness of Mood’s test justifies its exclusion from practical consideration. We see that the Wilcoxon test, calibrated exactly using its exact null distribution, falls short of the desired level; a lessconservative equaltailed test would have a level exceeding the nominal target of 0.05. The conservativeness of the Savage Score test is somewhat surprising. The close agreement between the level of the ttest and the nominal level with Gaussian data is as expected, as is the poor agreement between the level of the ttest and the nominal level with Cauchy data. As was done in Table 2.3, one might perform a similar simulation under the alternative hypotheses to calculate power. In this case, alternative hypotheses were generated by offsetting one group by one unit. Results are in Table 3.5. Table 3.5 excludes the exact version of the Wilcoxon test and Mood’s test, since for these sample sizes (Mj = 10 for j = 1,2), they fail to achieve the desired level for any data distribution. The approximate Wilcoxon test has comparable power to that of the ftest under the conditions optimal for the ftest, and also maintains high power throughout. Adaptation to the Presence of Tied ObservationsThe MannWhitneyWilcoxon statistic is designed to be used for variables arising from a continuous distribution. Processes expected to produce data with distinct values, however, sometimes produced tied values, frequently because of limits on measurement precision. Sometimes an observation from the first group is tied with one from the second group. Then the scheme for assigning scores must be modified. Tied observations are frequently assigned scores averaged over the scores that would have been assigned if the data had been distinct; for example, if 2p),.... Z/щ are the ordered values from the combination of the two samples, and if Z(_{J+1}) = Zthen both observation j and observation j + 1 are assigned score (aj + aj_{+}1)/2. The variance of the test statistic must be adjusted for this change in scores. When both tied observations come from the first group, or both from the second group, then one might assume that the tie arises because of imprecise measurement of a process that, measured more precisely, would have produced untied individuals. The test statistic is unaffected by assignment of scores to observations according to either of the potential orderings. However, the permutation distribution is affected, because many of the permutations considered will split the tied observations into different groups. Return to variance formula (3.11). The average rank d is unchanged by modification of ranks, but the average squared rank a changes by a^{2} + ^{a}j+1 ^{—} (^{a}j+^{a}j+ i )^{2}/2 = (aj — a_{J+}i)^{2}/2. Then, for each pair of ties in the data, the variance (3.11) is reduced by Mi М2 (a j — aj_{+})^{2}/(N — 1). This process could be continued for triplets, etc., with more complicated expressions for the correction. Lehmann (2006) derives these corrections for generic numbers of replicated values, in the simpler case in which aj = j: in this case, the correction is applied to the simpler variance expression (3.22). It is simpler, however, to bypass (3.22), and, instead of correcting (3.11), recalculating (3.11) using the new scores. When the assumption of continuity of the distributions of underlying measurements does not hold, the distribution of rank statistics is no longer independent of the underlying data distribution, since the rank statistic distribution will then depend on the probability of ties. Hence no exact calculation of the form in §3.4.1 is possible. It was noted at the end of §3.4.1.1 that continuity correction is of little importance in case of rank tests. When average ranks replace the original ranks, the continuity correction argument using Figure 2.2 no longer holds. Potential values of the test statistic are in some cases closer together than 1 unit apart, and, in such cases, the continuity correction might he abandoned. MannWhitneyWilcoxon Null HypothesesThe MannWhitneyWilcoxon test was constructed to test whether the distribution F of the X variables is the same as the distribution G of the Y variables. This null hypothesis implies that P [X*. < Yf = 1/2. Unequal pairs F and G violate the null hypothesis of this test. However, certain distribution pairs violating the null hypothesis fall in the alternative hypothesis, but the MannWhitneyWilcoxon test has no power to distinguish these. This is true if F and G are unequal but symmetric about the same point. In this case, the standard error of the MannWhitneyWilcoxon test statistic (3.11) is no longer correct, and the expectation under this alternative is the same as it is under the null. The same phenomenon arises if Jf? F(y)g(y) dy = 1/2. As an example, suppose that Yj ~ E ciency and Power of TwoSample TestsIn this section, consider models of the form (3.1), with the null hypothesis 0 = 0°. Without loss of generality, one may take в^{0} = 0; otherwise, shift Yj by 6°. Relative efficiency has already been defined for test statistics T% such that (Ti — fii(e))/((7,(0)/fN) ss (0,1), for N the total sample size. Asymptotic relative efficiency calculations require specification of how M_{x} and М2 move together. Let M_{x} = XN. М2 = (1 — A)N, for A € (0,1). Efficacy of the GaussianTheory TestAs in the onesample case, the large sample behavior of this test will be approximated by a version with known variance. Here p{6) = в, and Var [T] = ^{p2} (ж ^{+} щ) ^{= 1,2} (mh)^{+} ж)^{: 1}”““
for р^{2} the variance of each observation, and £ = /A(l — A). For example, suppose that Yj ~ @(0,1), and X, ~ @(0, 1). In this case, the efficacy is e = £. Alternatively, suppose that the observations are logistically distributed. Each observation has variance is тг^{2}/3, and the efficacy is e = Сч/З/тг = .551(5 The analysis of §2.4.1, for tests as in (2.15)and variance scaled as in (2.19), allows for calculation of asymptotic relative efficiency, in terms of the separate efficacies, defined as the ratio of p!{0) to а (в). Efficacy of the MannWhitneyWilcoxon TestIn order to apply the results for the asymptotic relative efficiency of §2.4, the test statistic must be scaled so that the asymptotic variance is approximately equal to a constant divided by the sample size, and must be such that the derivative of the mean function is available at zero. Using the MannWhitney formulation, and rescaling so that the T = X^f=i i < F})/(MiМ2), then and so Also,
For example, suppose that Yj ~ @(0,1), and X, ~ @(0,1). The differences X,; — Yj ~ @(0,2), and so
Hence /г'(0) = 1/(2Д7г). Also, (3.24) still holds, and
Alternatively, suppose that these distributions have a logistic distribution. In this case,
and
Efficacies for more general rank statistics may be obtained using calculations involving expectations of derivatives of underlying densities, with respect TABLE 3.6: Asymptotic relative efficiencies for Two Sample Tests
C = (A(l  A))1/2, e = /VaF[XiJ. to the model parameter, evaluated at order statistics under the null hypothesis, without providing rank expectations away from the null (Dwass, 1956). Summarizing Asymptotic Relative EfficiencyTable 3.6 contains results of calculations for asymptotic relative efficiencies of the MannWhitneyWilcoxon test to the Pooled ftest. For Gaussian variables, as expected, the Pooled ftest is more efficient, but only by 5%. For a distribution with moderate tails, the logistic, the MannWhitneyWilcoxon test is 10% more efficient. Power for MannWhitneyWilcoxon TestingPower may be calculated for MannWhitneyWilcoxon testing, using (2.22) in conjunction with (3.24) for the null variance of the rescaled test, and (3.25), adapted to the particular distribution of interest. Application to Gaussian and Laplace observations are given by (3.26) and (3.27) respectively. Zhong and Kolassa (2017) give second moments for this statistic under the alternative hypothesis, and allow for calculation of сг(в) for nonnull 9. The second moment depends not only on the probability (3.25), but also on probabilities involving two independent copies of X and one copy of Y. and of two independent copies of Y and one copy of X. This additional calculation allows the use of (2.17); calculations below involve the simpler formula. Example 3.8.1 Consider using two independent sets of observations each to test the null hypothesis of equal distributions vs. the alternative that (3.1) holds, with 0 = 1, and with observations having a Laplace distribution. Then, using (3.27), //(1) = e(e — 2)(e — l)^{2} = 0.661. The function fi(0) has a removable singularity at zero; fortunately the null probability is easily seen to be 1/2. Then A = 1/2, N = 80, p(0) = 1/2, /i(l) = 0.661, <т(0) = 1 /х/Т^7^1/2У1Г(Т72) = 1Д/3. The power for the onesided level 0.025 test, from (2.22), is Ф(/М(0.661 — 0.5)/(l//3) — 1.96) = Ф(0.534) = .707. One could also determine the total sample size needed to obtain 80% power. Using (2.24), ^{one} needs (1//3)^{2}(.го.о25 + ■$o.2)^{2}/(0.661 — 0.5)^{2} = 151.4; choose 76 per group. In contrast with the shift alternative (3.1), one might consider the Lehmann alternative
for some к ф 1. Power calculations for MannWhitneyWilcoxon tests for this alternative have the advantage that power does not depend on the underlying G (Lehmann, 1953). As noted above, while efficacy calculations are available for more general rank statistics, the nonasymptotic expectation of the test statistic under the alternative is difficult enough that it is omitted here. Testing Equality of DispersionOne can adapt the above rank tests to test whether two populations have equal dispersion, assuming a common center. If one population is more spread out than another, then the members of one sample would tend to lie outside the points from the other sample. This motivates the SiegelTukey test. Rank the points, with the minimum getting rank 1, the maximum getting rank 2, then the second to the maximum getting rank 3, the second to the minimum getting rank 4, the third from the minimum getting rank 5 and continuing to alternate. Then sum the ranks associated with one of the samples. Under the null hypothesis, this statistic has the same distribution as the Wilcoxon rank sum test. Alternately, one might perform the AnsariBradley test, by ranking from the outside in, with extremes getting equal rank, and again summing the ranks from one sample. The AnsariBradley test has a disadvantage with respect to the Siegel Tukey test, in that one can’t use offtheshelf Wilcoxon tail calculations. On the other hand, the AnsariBradley test is exactly invariant to reflection. TABLE 3.7: Yarn data with rankings for testing dispersion
Example 3.9.1 Consider again the yam data of Example 3.3.1. Test equality of dispersion between the two types of yam. Ranks are given in Table 3.7, and are calculated in package NonparametricHeuristic as yarn$ab<pmin(rank(yarn$strength),rank(yarn$strength)) yarn$st<round(siegel.tukey.ranks(yarnSstrength),2) yarnranks<yarn[order(yarnSstrength), c("strength","type","ab","st")] R functions may be used to perform the test. library(DescTools)#For SiegelTukeyTest SiegelTukeyTest(strength~type,data=yarn) yarnsplitcsplit(yarnSstrength,yarnStype) ansari.test(yarnsplit[[1]],yarnsplit[[2] ]) to find the SiegelTukey pvalue as 0.7179, and the AnsariBradley pvalue as 0.6786. There is no evidence of inequality of dispersion. 
<<  CONTENTS  >> 

Related topics 