Home Mathematics

Bivariate Methods

Suppose that independent random vectors (X,. Y,) all have the same joint density /x.y(a;, y). This chapter investigates the relationship between X, and У,; for each i. In particular consider testing the null hypothesis fx.y{x, у) = fX (x)f)' (у) •. without specifying an alternative hypothesis for fx.Y, or even without knowledge of the null marginal densities.

This null hypothesis is most easily tested against the alternative hypothesis that, vaguely, large values of X are associated with large values of Y (or vice versa). Furthermore, if the null hypothesis is not true, the strength of the association between X, and Yj must be measured.

Parametric Approach

Before developing nonparametric approaches to assessing relationships between variables, this section reviews a standard parametric approach, that of the Pearson correlation (Edgeworth. 1893):

derived as what would now be termed the maximum likelihood estimator of the correlation for the bivariate Gaussian distribution. This gives the slope of the least squares line fitting Y to X, after scaling both variables by standard deviation. The Cauchy-Schwartz Theorem says that rp is always in [—1,1]. Perfect positive or negative linear association is reflected in a value for rp of 1 or —1 respectively. Furthermore, there are a variety of early exact and approximate distributional results for rp assuming that the observations are multivariate Gaussian.

Permutation Inference

Even when observations are not multivariate Gaussian, the measure (6.1) remains a plausible summary of association between variables. This section presents a distributional result for the Pearson correlation that does not require knowledge of the multivariate distribution of observations.

Under the null hypothesis that Xj is independent of Yj, for all j, then every permutation of the Y values among experimental units, with the X values held fixed, is equally likely. Under this permutation distribution, conditional on the collections {X^ ..., Xn} and (У),..., Yj,}, the variance of rp may be calculated directly. Note that rp depends only on the differences between the observations and their means, and so, without loss of generality, assume that X = Y = 0. Let (Zi,.... Zn) be a random permutation of {У),..., Y,,}. Then Var [Zi] = Y)2/n: denote this by ay. Also

So

where a2x=J2i ЬХЬ So

under the permutation distribution (Hotelling and Pabst, 1936). This value was determined by Student (1908), after fitting a parametric model to empirical data, rounding parameter estimates to values more consistent with intuition, and calculating the moments of this empirical distribution. Higher-order moments were determined by David et al. (1951) using a method of proof similar to that presented above for the variance.

This result suggests a test of the null hypothesis of independence versus the two-sided alternative at level a using rp, that rejects the null hypothesis if

Nonparametric Correlation

The Pearson correlation pr is designed to measure linear association between two variables, and fails to adequately reflect non-linear association. Furthermore, a family of distributional results, not recounted in this volume, depend on data summarized being multivariate Gaussian. Various nonparametric alternatives to the Pearson correlation have been developed to avoid these drawbacks.

Rank Correlation

Instead of calculating the correlation for the original variables, calculate the correlation of the ranks of the variables (Spearman, 1904). Under the null hypothesis, each X rank should be be equally likely to be associated with each Y rank. Under the alternative, extreme ranks should be associated with each other. Let Rj be the rank of the Y value associated with X^y Define the Spearman Rank correlation as

that is, rs is the Pearson correlation on ranks. Exact agreement for ordering of ranks results in a rank correlation of 1, exact agreement in the opposite direction results in a rank correlation of -1, and the Cauchy-Schwartz theorem indicates that these two vales are the extreme possible values for the rank correlation.

The sums of squares in the denominator have the same value for every data set, and the numerator can be simplified. Note that

Similarly, ~ (n + l)/2)2 = n (n2 - l) /12. Furthermore, J2'j=iti -

(n + l)/2)(n + l)/2 = 0, and

Hoeffding (1948) provides a central limit theorem for the permutation distribution for both rp and rs, including under alternative distributions. P- values may be approximated by dividing the observed correlations by /n — 1, and comparing to a standard Gaussian distribution, but this approximation has poor relative behavior for small test levels. Best and Roberts (1975) correct the Gaussian approximation using an Edgeworth approximation (Kolassa, 2006); that is, they determine constants ki, k2, k3, and «4 such that

with the maximum error in equation (6.5) bounded by a constant divided by n3/2. Here hj(r) are known polynomials called Hermite polynomials, and constants Kj are related to the moments of rs. When k3 is calculated for a symmetric distribution, its value is zero, and approximation (6.5) effectively contains only its first and last terms. This is the case for many applications of (6.5) to rank-based statistics, including the application to rs.

Example 6.3.1 Consider again the twin brain data of Example 5.2.1, plotted in Figure 6.1 As before, the data set brainpairs has 10 records, reflecting the results from 10 pairs of twins, and is plotted in Figure 6.1 via

attach(brainpairs); plot(vl,v2, xlab="Volume for Twin 1", ylab="Volume for Twin 2",main="Twin Brain Volumes")

Ranks for twin brains are given in Table 6.1. The sum in the second factor of (6.4) is

1x1+4x2-|-9x9 + 5x5-(-3x4-|-6x7-|-7x3 + 10x 10+2x6+8x8 = 366,

and so the entire second factor is 366 — 10 x ll2/4 = 63.5, and the Spearman correlation is (12/(10 x 99) x 63.5 = 0.770. Observed correlations may be calculated in R using

cat(’ Permutation test for twin brain data ’) attach(brainpairs)

obsd<-c(cor(vl,v2),cor(vl,v2,method="spearman"))

to obtain the Pearson correlation (6.1) 0.914 and the Spearman correlation (6.4) 0.770. Testing the null hypothesis of no association may be performed using a permutation test with either of these measures.

out<-array(NA,c(2,20001))

dimnames(out)[[1]]<-c("Pearson","Spearman") for(j in seq(dim(out)[2])){ newvK-sample(vl)

out[,j]<-c(cor(newvl,v2),cor(newvl,v2,method="spearman"))

>

cat(" Monte Carlo One-Sided p value ") apply(apply(out,2,">=",obsd),1,"mean")

to obtain p-values 1.5 x 10-4 and 6.1 x 10-3. The asymptotic critical value, from (6.3), is given by

cat(" Asymptotic Critical Value ")

-qnorm(0.025)/sqrt(length(vl)-l)

which gives 0.6533. Permutation tests based on either correlation method reject the null hypothesis.

c(cor.test(vl,v2)\$p.value,

cor.test(vl,v2,method="spearman")\$p.value) detach(brainpairs)

giving p-value approximations 2.1 x 10-4 and 1.37 x 10-2 respectively.

FIGURE 6.1: Twin Brain Volumes

TABLE 6.1: Twin brain volume ranks

 Pair 1 2 3 4 5 5 7 8 9 10 Rank of First 1 4 9 5 3 6 7 10 2 8 Rank of Second 1 2 9 5 4 7 3 10 6 8

Note that

Some algebra shows that

This relationship will be exploited to give probabilities for rs in §6.4.1.3.

Pearson (1907) criticizes the Spearman correlation, in part on the grounds that, as it is designed to reflect association even when the association is nonlinear, it also loses the interpretation as a regression parameter, even when the underlying association is linear.

Alternative Expectation of the Spearman Correlation

Under the null hypothesis of independence between Xj and Yj, for all j, E [rs] = 0, and the variance is given by (6.2). Under the alternative hypothesis,

for

Kendall's

Pairs of bivariate observations for which the X and Y values are in the same order are called concordant: (6.7) refers to the probability that a pair is concordant. Pairs that are not concordant are called discordant. Kendall (1938) constructs a new measure based on counts of concordant and discordant pairs. Consider the population quantity r = 2p — 1, for p of (6.7), called Kendall’s т; to emphasize the parallelism between this measure and other measures of association between two random variables, it will also be referred to below as Kendall’s correlation measure. This quantity reflects the probability of a concordant pair minus the probability of a discordant pair. Denote the number of concordant pairs by

for Zij = I{{XjXi)(Yj — Yi) > 0). Then £/t = n(n—1)/2 — U is the number of discordant pairs; this number equals the number of rearrangements necessary to make all pairs concordant. Estimate r by the excess of concordant over discordant pairs, divided by the maximum:

Note that E [17] = n(nl)pi/2, for p of (6.7), and E [ry] = 2pi — 1. The null value of pi is half, recovering

As with the Spearman correlation rs, Kendall’s correlation ry may be used to test the null hypothesis of independence, relying on its asymptotic Gaussian distribution, but the test requires the variance of ry. Note that

Here ^2* the sum over three distinct indices i < j, к < l. This sum consists of n2(n — l)2/4 — n(n — l)/2 — n(n — l)(n — 2)(n — 3)/4 = n(n — l)(n — 2) terms. Hence

*

where

The null value of рз is as can be seen by examining all 36 pairs of permutations of {1,2,3}. Hence

and

The result of Hoeffding (1948) also proves that ry is approximately Gaussian, including under alternative distributions. El Maache and Lepage (2003) discuss the multivariate distribution of both ry and rs from collections of variables.

Example 6.3.2 Consider again the twin brain data of Example 5.2.1. with ranks in Table 6.1. Discordant pairs are 2 - 5, 2 - 9, 4 - 7, 4 ~ 9,

5 - 7, 5 - 9, 6 - 7, and 7-9. Hence 8 of 45 pairs are discordant, and the remainder, 37, are concordant. Hence rT = 4 x 37/90 — 1 = 0.644, from (6.8). This may be calculated using