deleted, see Section for a discussion). If you are completely unfamiliar with boxplots, see Denis (2020) for an overview.
Stem‐and‐leaf plots are also easily produced. These visual displays are kind of “naked histograms,” because they reveal the actual observations in the data while also providing information about their frequency of occurrence. In 1710, John Arbuthnot analyzed data on the ratios of males to female births in London from 1629 to 1710 and in so doing made an argument for these births being a function of a “divine being” (Arbuthnot, 1710; Shoesmith, 1987). One of his variables was the number of male christenings (i.e., baptisms) over the period 1629–1710. We generate a stem‐and‐leaf plot in R of these male christenings using package aplpack
(Wolf and Bielefeld, 2014), for which the “leaves” are corresponding hundreds. For example, in the following plot, the first value of 2|8 would appear to represent a value of 2800 but is rounded down from the actual value in the data (which is also the minimum) of 2890. The maximum in the data is actually equal to 8426, but is represented by 8400 (i.e., 8|0012334):
> install.packages(“aplpack”) > library(aplpack) > library(HistData) > attach(Arbuthnot) > stem.leaf(Males) 1 | 2: represents 1200 leaf unit: 100 n: 82 1 2. | 8 10 3* | 011222334 15 3. | 66777 18 4* | 014 25 4. | 6777899 36 5* | 01112233444 38 5. | 56 (11) 6* | 00001122444 33 6. | 5555899 26 7* | 244 23 7. | 5555666666778999 7 8* | 0012334
2.28 WHAT MAKES A p‐VALUE SMALL? A CRITICAL OVERVIEW AND PRACTICAL DEMONSTRATION OF NULL HYPOTHESIS SIGNIFICANCE TESTING
The workhorse for establishing statistical evidence in the social and natural sciences is the method of null hypothesis significance testing (or, “NHST” for short). However, since its inception with R.A. Fisher in the early 1900s, the significance test has been the topic of much debate, both statistical and philosophical. Throughout much of this book, NHST is regularly used to evaluate null hypotheses in methods such as the analysis of variance, regression, and various multivariate procedures. Indeed, the procedure is universally used in most statistical methods.
It behooves us then, before embarking on all of these methodologies, to discuss the nature of the null hypothesis significance test, and clearly demonstrate what it actually means, not only in a statistical context but also in how it should be interpreted in a research or substantive context.
The purpose of this final section of the present chapter is to provide a clear and concise demonstration and summary of the factors that influence the size of a computed p‐value in virtually every statistical significance test. Understanding why statements such as “p < 0.05” can be reflective of even the smallest and trivial of effects is critical for the practitioner or researcher to appreciate if he or she is to assess and appraise statistical evidence in an intelligent and thoughtful manner. It is not an exaggeration to say that if one does not understand the make‐up of a p‐value and the factors that directly influence its size, one cannot properly evaluate statistical evidence, nor should one even make the attempt to do so. Though these arguments are not new and have been put forth by even the very best of methodologists (e.g., see Cohen, 1990; Meehl, 1978) there is evidence to suggest that many practitioners and researchers do not understand the factors that determine the size of a p‐value (Gigerenzer, 2004). To emphasize once again—understanding the determinants of a p‐value and what makes p‐values distinct from effect sizes is not simply “fashionable.” Rather, it is absolutely mandatory for any attempt to properly evaluate statistical evidence in a research report. Does the paper you're reading provide evidence of a successful treatment for cancer? If you do not understand the distinctions between p‐values and effect sizes, you will be unable to properly assess the evidence. It is that important. As we will see, stating a result as “statistically significant” does not in itself tell you whether the treatment works or does not work, and in some cases, tells you very little at all from a scientific vantage point.
2.28.1 Null Hypothesis Significance Testing (NHST): A Legacy of Criticism
Criticisms targeted against null hypothesis significance testing have inundated the literature since at least the time Berkson in 1938 brought to light how statistical significance can be easily achieved by simple manipulations of sample size:
I believe that an observant statistician who has had any considerable experience with applying the chi‐square test repeatedly will agree with my statement that, as a matter of observation, when the numbers in the data are quite large, the P's tend to come out small. (p. 526)
Since Berkson, the very best and renown of methodologists have remarked that the significance test is subject to gross misunderstanding and misinterpretation (e.g., see Bakan, 1966; Carver, 1993; Cohen, 1990; Estes, 1997; Loftus, 1991; Meehl, 1978; Oakes, 1986; Shrout, 1997; Wilson, Miller, and Lower, 1967). And though it can be difficult to assess or evaluate whether the situation has improved, there is evidence to suggest that it has not. Few describe the problem better than Gigerenzer in his article Mindless statistics (Gigerenzer, 2004), in which he discusses both the roots and truths of hypothesis testing, as well as how its “statistical rituals” and practices have become far more of a sociological phenomenon rather than anything related to good science and statistics.
Other researchers have found that misinterpretations and misunderstandings about the significance test are widespread not only among students but also among their instructors (Haller and Krauss, 2002). What determines statistical significance and what is it a function of? This is an extremely important question. An unawareness of the determinants of statistical significance leaves the door open to misunderstanding and misinterpretation of the test, and the danger to potentially draw false conclusions based on its results. Too often and for too many, the finding “p < 0.05” simply denotes a “good thing” of sorts, without ever being able to pinpoint what is so “good” about it.
Recall the familiar one‐sample z‐test for a mean discussed earlier:
where the purpose of the test was to compare an obtained sample mean
As a first case, consider the distance