Home Mathematics



Multivariate AnalysisTable of Contents:
Suppose that one observes n subjects, indexed by i € {1,..., n}, and, for subject i, observes responses Xij, indexed bv j € {1,..., J}. Potentially, covariates are also observed for these subjects. This chapter explores explaining the multivariate distribution of X_{tJ} in terms of these covariates. Most simply, these covariates often indicate group membership. Standard Parametric ApproachesWhen data vectors may be treated as approximately multivariate Gaussian, the following standard techniques may be applied. Multivariate EstimationOften one wants to learn about a the vector of population central values for each of the j responses on the various subjects. In this section, assume that the vectors are independent and identically distributed. St andard parametric analyses presuppose distributions of data wellenough behaved that location can be wellestimated using a sample mean. Denote the mean by the vector X, where component j of this vector is given by Xj = Xij/n. Then X is the method of moments estimator for /i = E[W(]. Assumptions guaranteeing that X has an asymptotically Gaussian distribution generally include the existence of some moment of the distribution greater than the second moment. OneSample TestingIn this section, consider testing the null hypothesis that the vector ji of expectations takes on some value specified in advance; without loss of generality, take this value to be 0. Still assuming that the observations have a multivariate Gaussian distribution, then X is approximately multivariate Gaussian. First consider the case in which one knows the variance matrix E = Var [(Aji,..., Xjjr)], and assume that E is nonsingular. Then one can use as a test statistic
and its distribution under the null hypothesis is xj Dropping the multivariate Gaussian assumption, if E is unknown, and if one can estimate it as E using the usual sum of squares, then
has an F distribution, with J numerator degrees of freedom (Hotelling, 1931). If the distribution of (Хц, ..., Xij) has a density and a nonsingular variance matrix, then P £ singular j = 0. If E unknown, and is best estimated by a nonsingular E, which is other than the sum of squares estimator, then generally (7.2) is approximately j. These techniques require that X is approximate multivariate Gaussian. This assumption is stronger than the assumption that each margin is univariate Gaussian; a simulated example is given in Figure 7.1. TwoSample TestingSuppose that observations may be divided into two groups of sizes Mi and М2, with the group for observation г indicated by g, € {1,2}. Test the null hypothesis that the multivariate distributions in the two groups are identical; note that this implies identical variance matrices. Let Xk be the vector of sample means for observations in group k, with components Xkj = J2i_{Si}=k Xij/Mk Let Ek,j,j' be the sample covariance for the group к values between responses j and f: = J2i_{gi}=k(^{X}iJ ~ ^{X}k)(Xij> ~ ^{X}j)/(^{M}k ~ 1) Let. t_{jtj}, be the pooled sample covariance for the all observations: Sjjr = ((Mi — l)Ei.jj/ + (М2 — l)t_{2},j,j')/(Mi + M_{2} — 2). Then the Hotelling twosample statistic
measures the difference between sample mean vectors, in a way that accounts for sample variance, and combines the response variables. Furthermore, under the null hypothesis of equality of distribution, and assuming that this distribution is multivariate Gaussian, Nonparametric Multivariate EstimationIn the absence of such parametric assumptions, one might instead measure location using the multivariate median. FIGURE 7.1: Univariate Normal Data That are Not Bivariate Normal Define the multivariate population median и to be the vector of univariate medians, as defined in §2.3.1. An estimator smed [A_{b}..., AT,,] of the population multivariate median may be constructed as the vector of whose components are the separate marginal sample medians; that is, smed [A_{L},..., X_{n}] = (smed [An, • • •, А,„],...,smed [X.n,..., A/,,]). Alternatively, one might define smed [Aj,..., A_{n}] so to minimize the sum of distances from the median:
that is, the estimate minimizes the sum of distances from data vectors to the parameter vector, with distance measured by the sum of absolute val?ues of componentwise differences. Because one can interchange the order of summation in (7.4), the minimizer in (7.4) is the vector of componentwise minimizers. Furthermore, the minimizer for each component is the traditional univariate median as above. A summary of multivariate median concepts is given by Small (1990). Equivariance PropertiesIn the univariate case (that is, J = 1), both the mean and the median are equivariant with respect to affine transformations of the raw data, as seen in §2.1.1 and §2.3.1. Equivariance to affine transformations in the multivariate case holds for the mean: for a vector a and a matrix В with J columns, and for Yj = a + BXj for all j, then Y = a + BX. A similar equality fails to hold for smed [Xi,..., X_{n}] and smed [Yj,.... Y,,], unless В is diagonal; hence the multivariate median is not equivariant under affine transformations. Nonparametric OneSample Testing ApproachesConsider a null hypothesis stating that the marginal median vector и takes on a value specified in advance; without loss of generality, take this to be zero. In the multivariate Gaussian context, the statistic (7.2) represents the combination of separate location test statistics for the various components of the random vectors, and its distribution depends on multivariate normality of the underlying observations; an analogous statistic combining the various dimensions of X, that does not depend on parametric assumptions is constructed in this section. A nonparametric hypothesis test can be constructed by assembling componentwise nonparametric statistics into a vector T, analogous to X. and centered so that Eo [T] = 0. One might combine sign test statistics, or signedrank statistics if one assumes symmetry, often in the context of paired data. That is,
for S(u) = < ^ Or, define Rij(X) to be the marginal rank of ^{V} ' [1 for ы < О ^{П} ' Xjj among {Xy,..., X_{nj}}, and set A multivariate test statistic is constructed as a vector of univariate statistics,
Then combine components of T from (7.5) to give the multivariate sign test statistic, or from (7.6) to give the multivariate sign rank test. In either case, components are combined using
for T = Varo [Т]. As in §2.3, in the case that the null location value is 0, the null distribution for the multivariate test statistic is generated by assigning equal probabilities to all 2” modifications of the data set by multiplying the rows (Xa, • • •, Xjj) by +1 or —1. That is, the null hypothesis distribution of T(X) is generated by placing probability 2~^{n} on all of the 2" elements of
Test statistics (7.1) and (7.2) arose as quadratic forms of independent and identically distributed random vectors, and the variances included in their definitions were scaled accordingly. Statistic (7.3) is built using a more complicated variance; this pattern will repeat with nonparametric analogies to parametric tests. Combining univariate tests into a quadratic form raises two difficulties. In previous applications of rank statistics, that is, in the case of univariate sign and signedrank onesample tests, in the case of twosample MannWhitney Wilcoxon tests, and in the case of of KruskalWallis testing, all dependence of the permutation distribution on the original data was removed through ranking. This is not the case for T, since this distribution involves correlations between ranks of the various response vectors. These correlations are not specified by the null hypothesis. The separate tests are generally dependent, and dependence structure depends on distribution of raw observations. The asymptotic distribution of (7.7) relies on this dependence via the correlations between components of T. The correlations must be estimated. Furthermore, the distribution of W of (7.7) under the null hypothesis is dependent on the coordinate system for the variables, but, intuitively, this dependence on the coordinate system might be undesirable. For example, suppose that (А_{1г}, X2,) has an approximate multivariate Gaussian distribution, with expectation /i. and variance X. with X known. Consider the null hypothesis H_{0} : p = 0. Then the canonical test is (7.1), and it is unchanged if the test is based on (Uj. Vi) for 17,; = Хц + X2i and V) = Хц — X2_{t}, with X modified accordingly. Hence the parametric analysis is independent of the coordinate system. The first of this difficulty is readily addressed. Under Ho, the marginal sign test statistic (7.5) satisfies Tj/y/n к {?>((), 1). Conditional on the relative ranks of the absolute values of the observations, the permutation distribution is entirely specified, and conditional joint moments are calculated. Under the permutation distribution,
and so the variance estimate used in (7.7) has components Vjj> = Using the data to redefine the coordinate system may be used to address the second problem (Randles, 1989; Oja and Randles, 2004). Combine the components of T(X) to construct the statistic
using an estimate E of X = Var [T] as in (7.9), or similarly for the signedrank statistic. The multivariate central limit theorem of Hajek (1960), and the quality of approximation to T, justifies approximating the null distribution of W by Xj distribution. The test rejects the null hypothesis of zero componentwise medians when for Gj^{1}(l — q, 0) the 1 — a quantile of the j distribution, with noncentrality parameter 0. Bickel (1965) discusses these (and other tests) in generality. Example 7.3.1 Consider the data of Example 6.42. We test the null hypothesis that the joint distribution of systolic and diastolic blood pressure changes is symmetric about (0,0), using Hotelling’s T^{2} and the two asymptotic tests that substitutes signs and signedranks for data. This test is performed in R using # For Hotelling and multivariate rank tests resp: library(Hotelling); library(ICSNP) cat(’ Onesample Hotelling Test ’) HotellingsT2(bp[,c("spd","dpd") ]) cat(’ Multivariate Sign Test ’) rank.ctest(bp[,c("spd","dpd")],scores="sign") cat(’ Multivariate Signed Rank Test ’) rank.ctest(bp[,c("spd","dpd")]) Pvalues for Hotelling’s T^{2}, the marginal sign rank test, and marginal sign test, are 9.839 x 10^{6}, 2.973 x 10^{3}, and 5.531 x 10^{4}. Tables 7.1 and 7.2 contain attained levels and powers for onesample multivariate tests with two manifest variables of nominal level 0.05, from various distributions. TABLE 7.1: Level of multivariate tests
TABLE 7.2: Power of multivariate tests
Tests compared are Hotelling’s T^{2} tests, and test (7.10) applied to the sign and signedrank tests. Tests have close to their nominal levels, except for Hotelling’s test with the Cauchy distribution; furthermore, the agreement is closer for sample size 40 than for sample size 20. Furthermore, the sign test power is close to that of Hotelling’s test for Gaussian variables, and the signedrank test has attenuated power. Both nonparametric tests have good power for the Cauchy distribution, although Hotelling’s test performs poorly, and both perform better than Hotelling’s test for Laplace variables. Some rare data sets simulated to create Tables 7.1 and 7.2 include some for which T is estimated as singular. Care must be taken to avoid difficulties; in such cases, pvalues are set to 1. More General Permutation SolutionsOne might address this problem using permutation testing. First, select an existing parametric test statistic U(X), perhaps a Hotelling statistic, or a rankbased statistic. Under the permutation null distribution, the sampling distribution puts equal weight 2^{n} to all 2^{n} values of the statistic evaluated at each element of (7.8); these 2" values need not all be unique. For n large enough to make exhaustive evaluation prohibitive, a random subset of elements of (7.8) may be selected. The pvalue is reported as the proportion of data sets with permuted signs having the test statistic value as large as, or larger than, that observed. In this way, the analysis of the previous subsection for the sign test, and by extension the signedrank test, can be extended to general rank tests, including tests with data as scores. 
<<  CONTENTS  >> 

Related topics 