STATISTICAL SIGNIFICANCE VERSUS PRACTICAL IMPORTANCE

The role of significance testing is to determine whether a particular result is obtained as a result of chance or the result is so unusual that it would occur only by chance on rare occasions. Statistics were first employed in agricultural research and a statistical result so uncommon that it would occur by chance less than 1 time out of 20 (or 5 times in 100) led to the widespread adoption of a criterion of significance set at p < .05. There is nothing that precludes the investigator from being more conservative. For example, a p value of <.01 reflects that a chance result or a Type I error (falsely rejecting the null hypothesis or, for example, accepting that an intervention has a positive effect when it does not) would occur less than 1% of the time. In classic between-groups t tests, a group mean would be calculated for each of the groups and then the group mean would be subtracted from each individual’s score in the group. Unfortunately, unless one squares the difference scores, they add up to zero. Thus, the sum of squared deviations is calculated for each group, and then divided by the number of scores in each group. This results in a measure of variability that can be calculated by taking a square root of the sum of squares divided by n (although n - 1 is used in most inferential statistical equations). This results in a standard deviation that can be calculated for each group that provides a measure of how the group average or mean is representing the central tendency of the data. In the case of a t test, the difference between means over the pooled average standard deviation results in a ratio that, if sufficiently large given the number of study participants (and resultant degrees of freedom), will reach statistical significance. A one-way ANOVA or an F test is simply the variance of group means around a grand mean (the mean of the means) divided by the pooled standard deviation. Thus, the ratio provided by a t test or an F test can be a result of a large difference between means (the between-group effect) or a small standard deviation (the within-group effect).

Since the number of study participants affects the formula for standard deviation and a large number of participants requires a smaller t ratio or F ratio for statistical significance, groups with very large numbers of subjects can achieve statistical significance without an actually large effect. A statistically significant effect means that a researcher can trust the reliability of his or her results at a specific p value. However, in the cases of a large n, statistically significant results do not necessarily mean that a result is practically important. For example, there may be an experiment in which there are 1,000 persons in each of three different groups. One could calculate the effect size for a t test or an ANOVA F by dividing the explained sum of squares total over the total sum of squares in the model and obtain a value equivalent to R^{2}, the proportion of variance explained by the model. In this case, an effect size of less than 6% of the explained variability might produce a statistically significant result but a trivial finding. This is why it is essential that investigators a priori specify a clinically meaningful effect size that is suggested by existing literature or that can be argued to have clinical significance (see Chapter 17 for a discussion on clinical significance). On the other hand, one may have very large group differences but may not achieve statistical significance because of a small number of subjects. One cannot tout this finding because, if it does not reach statistical significance, it is not considered reliable. However, such a large effect size would likely prompt the experimenter to conduct a larger experiment with a greater number of subjects.