Home Mathematics

# Missingness Mechanisms

As pointed out earlier, the mechanism that generates missing values may be related to the study drug or it may be a random phenomenon. In the literature, three types of missingness mechanisms are frequently referenced, depending on whether the mechanism is associated with the observed or unobserved outcomes and other background variables (see, e.g., Diggle and Kenward 1994). Some of the ideas discussed below are especially germane to the special case when the design of a study involves taking repeated measurements on subjects over time.

When the missingness is independent of the subjects responses and other attributes, it is referred to as missing completely at random (MCAR). In this case one may assume that the subjects with missing data are a random sample of all the subjects. A case in point is when a study participant is lost to follow up due to the subjects change in location for reasons unrelated to the disease or treatment. On the other hand, if the missingness depends only on observed data, then it may be classified as missing at random (MAR). This arises, for example, when collected data suggest that the reason a patient dropped out of a study is because either the drug did not improve the patients condition, or the drug turned out to be toxic. When this mechanism applies, it may be safe to assume that the unobserved or missing data follow the same distribution as those observed values in subjects who have complete information and share the same observed measurements. The MAR assumption implies that after conditioning on observed variables the missingness can be assumed to be MCAR. When the underlying assumptions can be justified, MCAR and MAR scenarios, sometimes referred to as “ignorable,” permit application of certain statistical models that yield valid results.

A more difficult, but plausible situation, is one in which the missingness depends on the unobserved response measurements or cannot completely be characterized by the observed information. Often referred to as non-ignorable, or missing not at random (MNAR), this case requires caution, and is generally addressed by sensitivity analyses after an MAR analysis.

There is no formal approach to establish the mechanism by which missing values are generated. In particular, MCAR and MAR are generally untestable, and MNAR is purely speculative. However, as a best practice, one should perform exploratory data analysis to understand the pattern and nature of the missing values. Some simple techniques may include summarizing the frequency and reasons for missing values by study drug and over time; and evaluating and comparing outcome measures as well as other important factors such as demographics, for patients with complete data against those with incomplete observations. To identify any potential association between observed variables and the missing mechanism, one may also perform a suitable model, such as penalized logistic regression, with the missing indicator as the dependent variable. Potential predictors may include safety variables, baseline characteristics, and earlier responses.

# Approaches for Missing Data

There are alternative approaches for handling missing data during the analysis phase. The choice of a primary method should, however, be made a priori, considering the design of the study, outcome measures, and current regulatory requirements. In general, methods based on MCAR or MNAR assumptions may not be defensible for use in primary analysis. However, the latter may be used in sensitivity analysis concerning the robustness of MAR-based methods, which are commonly implemented for primary analyses.

For the reasons discussed earlier, complete case analysis cannot be justified for the primary analysis, especially in confirmatory trials. It may, however, be considered in early phases of drug development for exploratory purposes or as supportive analysis to confirm the sensitivity of conclusions drawn based on other approaches.

A simple method of handling missing data is the so-called hot-deck imputation, which involves replacing a missing value with a suitable observed value obtained from a matched group of study subjects. Matching may be accomplished using predefined variables and score functions, such as propensity scores (Rosenbaum and Rubin 1983) and the Mahalanobis distance. Since this approach assumes MAR, conditional on the matching variables, the impact of any unobserved variables on the robustness of the results cannot be fully assessed. A high-level overview of the approach may be found, for example, in Andridge and Little (2010).

With longitudinal data involving dropouts, an imputation approach is to carry forward a previously observed value. This approach was commonly used and accepted by regulatory authorities in the past but less so currently due to likely biases in the estimation. Other variations include the best observation or baseline observation (BOCF) or the worst observation (WOCF) carried-forward schemes. When the primary objective of the study is to estimate a treatment effect at the end of a fixed treatment duration, the last observation carried-forward (LOCF) approach is dependent on the assumption of constant disease status after the last observed data; therefore, it can only be applied in the unrealistic case of MCAR and it may lead to bias in cases of MAR or MNAR scenarios. However, if one is interested in estimating a “real-world” treatment effect and the dropout pattern is assumed to represent the real-world performance of the treatment then the last observation yields a valid estimate of real-world performance. Thus, the clinical question being addressed relates to the emerging concept of a valid

“estimand” (Section 2.3.5). BOCF, which uses the baseline observation as the final response, is often based on the assumption that a patient withdrew from the trial because of lack of benefit or due to treatment-emergent adverse events. When there are other reasons why patients might withdraw from the trial, the approach would not be reliable (Liu-Seifert et al. 2010).

In general, single imputation methods are likely to lead to incorrect standard errors and, hence, incorrect inferential results, since the error associated with the imputed values is not fully accounted for when performing complete case analysis with the imputed values. Therefore, it is customary to use alternative approaches, such as multiple imputation and likelihood-based inference.

Multiple imputation, first introduced by Rubin (1987), involves imputing each missing value many times, with a view to generating a between-imputation variance component. These data sets, consisting of the multiple-imputed values, are subsequently analyzed using appropriate procedures for complete data. The results from the different data sets are then combined. The approach results in valid hypothesis tests and confidence intervals, which are performed incorporating the uncertainty due to the imputed values.

Several methods are available for computing the imputed values in the above framework, depending on the variable types and missing-data pattern. In general, these imputation methods depend on a MAR assumption. In the case of continuous data, with monotone missing pattern, for example, Rubin (1987) proposes the use of a parametric regression method under multivariate normality or a nonparametric approach based on propensity scores (see, e.g., Lavori, Dawson, and Shera 1995). For a categorical variable with monotone missing patterns, one may implement a logistic-regression model or the discriminant function method. With arbitrary missing-data pattern, imputation may be performed using Markov chain Monte Carlo (MCMC), assuming multivariate normality (Schafer 1997). Other approaches include a fully conditional specification (FCS) method (van Buuren 2007), which assumes a joint distribution for all variables.

Alternatively, likelihood-based methods can be applied under MCAR or MAR assumptions, conditional on observed outcome measurements and baseline covariates. The approaches do not involve explicit creation of imputed values but involve implicit imputations for missing values. One such an approach is the expectation-maximization (EM) algorithm (Mallinckrodt 2003), which is an iterative process involving expectation and maximization steps. Informally, the algorithm consists of first estimating the parameters of the model on the basis of complete data, which in turn is used to estimate the missing values. The process is repeated iteratively until convergence.

For longitudinal data, in which observations are taken repeatedly over time and MAR assumptions are justifiable, several models are available that have reasonable performance relative to simple imputation methods. When the outcome variable is continuous, mixed-effect models for repeated measures (MMRM) can be used, with careful specification of the covariance matrix structure for the error term. For categorical responses and count data, the generalized linear mixed models (GLMM) have been proposed. The generalized estimating equations (GEE) approach is often used for longitudinal response data, but the method gives unbiased estimators if the missing-data mechanism only depends on the covariates included in the model (Fitzmaurice et al. 2000). Extensions of the approach are available, including the work by Robins et al. (1995) and Preisser et al. (2002), who proposed weighting schemes for GEE models that exhibit desirable performance under MAR assumptions.

There are many software programs designed to implement longitudinal data models under the ignorable situation. Commonly used examples include the R functions Ime and nlme and the SAS procedures MIXED, GLIMMIX, and NLMIXED. However, caution should be exercised in the use of these models, since in the non-ignorable case the results will be subject to bias. In the following, we review some steps that should be taken in order to complement and strengthen the analyses based on these models.

 Related topics