Selected Statistical Topics of Regulatory Importance
This chapter provides a detailed discussion of major statistical issues that commonly arise in the course of drug development and regulatory interactions. The section on multiplicity outlines measures that should be taken to ensure the validity of inferential results that are intended to be the basis for regulatory decision-making. In a separate section, a thorough review of best practices is provided to handle missing values, which are ubiquitous in clinical trials, with special reference to pertinent guidelines and the emergent topic of estimands. While superiority trials are common to support drug approval, there are situations where it is necessary to conduct non-inferiority studies. The regulatory requirements and underlying principles of such trials are summarized, and suggestions are provided relating to the salient points to be considered for both efficacy and safety assessment. In light of the increasing focus to accelerate drug development by improved efficiency, a summary of a few of the commonly used novel approaches is provided, including adaptive and flexible designs, enrichment studies, and studies conducted under the so-called master protocols. Other topics of regulatory and statistical import covered inthis chapter include Bayesian approaches, issues with subgroup analysis, biomarkers, and the assessment of benefits and risks of pharmaceutical products.
The issue of multiplicity refers in general to the inflation of the Type 1 error in the interpretation of clinical trial results. Controlling the probability of falsely concluding a treatment effect is of special concern to regulators and hence multiplicity is often an important statistical issue in the review of confirmatory clinical trials. The question of multiplicity can arise in many ways. We will address three main areas of multiplicity in the regulatory setting: multiple primary endpoints and secondary endpoints with the potential to be included in the product label; multiple testing in the course of the study with the purpose to stop for positive results (interim analyses); and subgroup analyses. Within the multiple endpoint section, we will briefly discuss other aspects of a study design that can inflate the Type 1 error rate. Further detailed discussions of multiplicity may be found, e.g., in Alosh et al. (2014), Dmitrienko et al. (2013), and Huque et al. (2013), among others.
Statistical hypothesis testing in a regulatory setting involves the calculation under the null hypothesis of the probability that the observed treatment effect on a specific variable is due to chance alone. In a randomized study, if this probability (p-value) is low, the null hypothesis (usually that the treatment effect is 0) is rejected and a treatment effect is established. If only one primary variable is used to establish a treatment effect, then the requirement that p < a controls the probability of incorrectly concluding a treatment effect at a. The issue of multiplicity of endpoints refers to the clinical trial setting where more than one variable is used to establish a treatment effect. The chances of obtaining at least one p-value below a increase with the number of endpoints. For example, the probability under the null hypothesis that at least one p-value is less than 0.05 for three independent hypotheses is 1 - (0.95)3 = 0.14. Thus, regulators cannot accept a level a test for each variable if the goal is to rule out incorrectly concluding a treatment effect (Type I error) at an overall probability of a.
In a good clinical trial design, there should be a set of primary endpoints and level of significance specified in the protocol that will determine whether the study has met its objective or not. The set of primary endpoints consists of the measures that establish the effectiveness of the drug in order to support regulatory action. When there is more than one primary endpoint and an effect on any of the endpoints is sufficient to establish the drugs effectiveness, then the rate of falsely concluding the drug is effective is increased over the Type I error used for each hypothesis. Consequently, if the goal is to control the probability that a chance finding is misinterpreted as treatments effect at level a, then an adjustment must be made to the significance tests of the individual variables.
There are many statistical methods to control for overall Type I error in the setting of multiple primary endpoints. Commonly used procedures are Bonferroni, Hochberg, Holms, and general sequential testing procedures. In the Bonferroni procedure, a is typically divided evenly among the total number of variables T and the individual p-values are compared to a/T. Holms and Hochberg are both multistep procedures where the observed p-values are compared to a/T, a/(T - 1), ..., a. The Holms procedure begins with the smallest observed p-value compared to a/T and continues to larger p-values until the observed p-value is not significant. The Hochberg procedure begins with the largest p-value compared to a and continues to lower p-values until significance is reached in which case all variables with lower p-values are also significant. General sequential testing procedures allow for a prespecified ordering of the variables and carrying forward any unused a.
Multiple-testing procedures have been well-described in the literature (e.g., Dmitrienko et al. 2013; Proschan and Waclawiw 2000) and it is not the purpose to review them here, or to recommend one procedure over the other. The multiple comparison procedure should be prespecified and selected in the context of the specific protocol objectives and the expected treatment effects on the multiple primary endpoints. In some cases there may be a regulatory requirement to show that the treatment is effective on more than one endpoint. For example, a treatment for Alzheimer’s disease might have to show effectiveness on both a measure of cognitive function and on a measure of quality of life. In this case no multiple comparison procedure is required.
In clinical trials it would be remiss not to assess many measures of change in the patient’s disease state outside of the primary endpoint(s). While these measures are not sufficient in and of themselves to establish effectiveness in the disease under study, it may be important to include them in the package insert given that the primary endpoint(s) have established a sufficient basis for approval. The analysis and interpretation of a drug’s effectiveness on these secondary endpoints may also require a multiple comparison procedure to control overall Type I error at a prespecified level. Positive results (nominal statistical significance) from a list of secondary endpoints without Type 1 error control would not be enough evidence to conclude a treatment effect and not likely to lead to inclusion in the package insert or in promotional material. There are small differences in the FDA guidance and the EMA guidance on Type 1 control of secondary endpoints. FDA guidance (FDA 2017b) states: “This includes controlling the Type I error rate within and between the primary and secondary endpoint families”; whereas the EMA guidance (EMA 2017b) states:
Including secondary endpoints in a multiple testing procedure (e.g., a “hierarchy”) is therefore not mandated, but permits a quantification of the risk of a type I error regarding these endpoints, which may lend support that an individual result is sufficiently reliable when included in the Summary of Product Characteristics.
Thus, Type 1 error control by a multiple comparison procedure on secondary endpoints is extremely useful from the sponsors perspective in order to potentially get the information in the package insert and is extremely useful from the regulators perspective to control the misinterpretation of a chance finding as a treatment effect. It is recommended that important secondary endpoints for potential label implications should be specified in the protocol along with an appropriate multiple comparison procedure to control Type I error at a.
Some additional important points to consider from a statistical-regulatory perspective regarding Type 1 error control are:
An important type of multiplicity involves the analysis of the individual components of composite or multicriteria endpoints. When there are competing outcomes of interest for use as a primary endpoint, it may be advisable to combine them into a single variable or score. In addition to avoiding multiplicity issues, the so-called composite endpoints may be defined to gain power when the incidence rate on the components is anticipated to be low. In some cases, e.g., patient-reported outcomes (PROs), a multicomponent endpoint may be collapsed into a single overall score using suitable summary statistics, such as the sum or average across the individual domain scores. Recently, alternative approaches have been proposed for defining and analyzing composite endpoints. Examples include the win ratio, proposed by Pocock et al. (2012), and the joint rank test of Finkelstein and Schoenfeld (1999). In general, when the composite endpoint is significant, it may be worthwhile to assess the effect of treatment on the components separately. Here a multiplicity procedure should be specified to control falsely concluding a treatment effect on any given component of the composite endpoint. There is also the regulatory interest that the observed benefit of the study drug is not unduly influenced by one or more endpoints of lower relevance in the composite endpoint.
In the above discussion, the focus has been on hypothesis testing. Although confidence intervals are generally used to specify the magnitude of the treatment effect and the associated degree of precision, in some cases they may be used to test hypotheses. When that is the case, it would be appropriate to ensure that multiplicity issues are addressed accordingly.
Multiple endpoints are not the only source of multiplicity in clinical trials that can lead to an inflated Type 1 error rate. For example, a confirmatory clinical trial may have more than one dose group where at least one dose must be significantly better than placebo. An oncology study could consist of more than one type of cancer or even the same cancer with different predetermined cell markers, and the objective is for approval of either type of cancer or cell type. These multiple objectives can inflate Type 1 error. When multiple primary endpoints are included in these more complex designs, the control of Type 1 error can be more difficult. However, from the regulators’ perspective any confirmatory conclusion regarding an indication (or claim within the package insert) must have Type 1 error control to protect against a false positive statement.
The above discussion on multiplicity in the context of regulatory decision-making is from a hypothesis testing, frequentist approach to statistics, which is the viewpoint embedded in ICH E9 Statistical Principles for Clinical Trials (1998) and both the FDA and EMA guidance on multiplicity issues. While ICH E9 is dominated by the frequentist approach, it does not rule out Bayesian approaches. With the increasing use of more complex, multi-objective clinical trials, as well as adaptive and even more innovative clinical trial designs, the use of Bayesian methods to inform regulatory decision-making may increase in the future. However, for now the dominant regulatory perspective is for the sponsor to control the probability of a study falsely “winning” (concluding a treatment effect) given all the ways of winning as defined within the protocol.