be one. Measures of effect size, interpreted in conjunction with significance tests, help to communicate whether something has “happened” or “not happened” in the given study or experiment. The reader interested in effect sizes can turn to a multitude of sources (Cortina and Nouri, 1999; Rosenthal, Rosnow, and Rubin, 2000). For our purposes, it suffices to review the principle of an effect size measure rather than catalog the wealth of possibilities for effect sizes available. Perhaps the easiest and most straightforward way of conceptualizing an effect size is to consider a measure of standardized statistical distance, or Cohen's d, already featured in our computations of power.
2.28.6 Statistical Distance: Cohen's d
For a one‐sample z‐test, Cohen's d (Cohen, 1988) is defined as the absolute distance between the observed sample mean and the population mean under the null hypothesis, divided by the population standard deviation:
In the above, since
As an example, where
Cohen offered the guidelines of 0.20, 0.50, and 0.80 as representing small, medium, and large effects respectively (Cohen, 1988). However, relying on effect size guidelines to indicate the absolute size of an experimental or nonexperimental effect should only be done in the complete and absolute absence of all other information for the research area. In the end, it is the researcher, armed with knowledge of the history of the phenomenon under study, who must evaluate whether an effect is small or large. For instance, referring to the achievement example discussed earlier, Cohen's d would be equal to:
The effect size of 0.1 is small according to Cohen's guidelines, but more importantly, also small substantively, since a difference in means of 1 point is, by all accounts, likely trivial. In this case, both Cohen's guidelines and the actual substantive evaluation of the size of effect coincide. However, this is not always the case. In physical or biological experiments, for instance, one can easily imagine examples for which an effect size of even 0.8 might be considered “small” relative to the research area under investigation, since the degree of control the investigator can impose over his or her subjects is much greater. In such cases, it may very well be that Cohen's d values in the neighborhood of two or three would be required for an effect to be considered “large.” The point is that only in the complete absence of information regarding an area of investigation is it appropriate to use “rules of thumb” to evaluate the size of effect. Cohen's d, or effect size measures in general, should always be used in conjunction with statements of statistical significance, since they tell the researcher what she is actually wanting to know, that of the estimated separation between sample data (often in the form of a sample mean) and the null hypothesis under investigation. Oftentimes meta‐analysis, which is a study of the overall measure of effect for a given phenomenon, can be helpful in comparing new research findings to the “status quo” in a given field. For a thorough user‐friendly overview of the methodology, consult Shelby and Vaske (2008).
2.28.7 What Does Cohen's d Actually Tell Us?
Writing out a formula and plugging in numbers, unfortunately, does not necessarily give us a feeling for what the formula actually means. This is especially true with regard to Cohen's d. We now discuss the statistic in a bit more detail, pointing out why it is usually interpreted as the standardized difference between means.
Imagine you have two independent samples of laboratory rats. To one sample, you provide normal feeding and observe their weight over the next 30 days. To the other sample, you also feed normally, but also give them regular doses of a weight‐loss drug. You are interested in learning whether your weight‐loss drug works or not. Suppose that after 30 days, on average, a mean difference of 0.2 pounds is observed between groups. How big is a difference of 0.2 pounds for these groups? If the average difference in weight among rats in the population were very large, say, 0.8 pounds, then a mean difference of 0.2 pounds is not that impressive. After all, if rats weigh very differently from one rat to the next, then really, finding a mean difference of 0.2 between groups cannot be that exciting. However, if the average weight difference between rats were equal to 0.1 pounds, then all of a sudden, a mean difference of 0.2 pounds seems more impressive, because that size of difference is atypical relative to the population. What is “typical?” This is exactly what the standard deviation reveals. Hence, when we are computing Cohen's d, we are in actuality producing a ratio of one deviation relative to another, similar to how when we compute a z‐score, we are comparing the deviation of y − μ with the standard deviation σ. The extent to which observed differences are large relative to “average” differences will be the extent to which d will be large in magnitude.
2.28.8 Why and Where the Significance Test Still Makes Sense
At this point, the conscientious reader may very well be asking the following question: If the significance test is so misleading and subject to misunderstanding and misinterpretation, how does it even make sense as a test of anything? It would appear to be a nonsensical test and should forever be forgotten. The fact is that the significance test does make sense, only that the sense that it makes is not necessarily always scientific. Rather, it is statistical. To a pure theoretical statistician or mathematician, a decreasing p‐value as a function of an increasing sample size makes perfect sense—as we snoop a larger part of the population, the random error we expect typically decreases, because with each increase in sample size we are obtaining a better estimate of the true population parameter. Hence, that we achieve statistical significance with a sample size of 500 and not 100, for instance, is well within that of statistical “good sense.” That is, the p‐value is functioning as it should, and likewise yielding the correct statistical information.
However, statistical truth does not equate to scientific truth (Bolles, 1962). Statistical conclusions should never be automatically equated with scientific ones. They are different and distinct things. When we arrive at a statistical conclusion (e.g., when deciding to reject the null hypothesis), one can never assume that this represents anything that is necessarily or absolutely scientifically meaningful. Rather, the statistical conclusion should be used as a potential indicator that something scientifically interesting may have occurred, the evidence for which must be determined by other means, which includes effect sizes, researcher judgment, and putting the obtained result into its proper interpretive context.
2.29 CHAPTER SUMMARY AND HIGHLIGHTS
To understand advanced statistical procedures, it is necessary to have a firm grasp on the foundations of introductory statistics. Advanced procedures are typically extensions of first principles.
Densities are theoretical probability distributions. The normal univariate density is an example.