as small or as large as one would like by choosing to do a study or experiment such that the combination of
The important point here is that a large value of zM does not necessarily mean something of any practical or scientific significance occurred in the given study or experiment. This fact has been reiterated countless times by the best of methodologists, yet too often researchers fail to emphasize this extremely important truth when discussing findings:
A p‐value, no matter how small or large, does not necessarily equate to the success or failure of a given experiment or study.
Too often a statement of “p < 0.05” is recited to an audience with the implication that somehow this necessarily constitutes a “scientific finding” of sorts. This is entirely misleading, and the practice needs to be avoided. The solution, as we will soon discuss, is to pair the p‐value with a report of the effect size.
2.28.3 The Issue of Standardized Testing: Are Students in Your School Achieving More Than the National Average?
To demonstrate how adjusting the inputs to zM can have a direct impact on the obtained p‐value, consider the situation in which a school psychologist practitioner hypothesizes that as a result of an intensified program implementation in her school, she believes that her school's students, on average, will have a higher achievement mean compared to the national average of students in the same grade. Suppose that the national average on a given standardized performance test is equal to 100. If the school psychologist is correct that her students are, on average, more advanced performance‐wise than the national average, then her students should, on average, score higher than the national mark of 100. She decides to sample 100 students from her school and obtains a sample achievement mean of
On degrees of freedom equal to n − 1 = 100 – 1 = 99, for a two‐tailed test, we require a t statistic of ± 1.984 for the result to be statistically significant at a level of significance of 0.05. Hence, the obtained value of t = 1 is not statistically significant. That the result is not statistically significant is hardly surprising, since the sample mean of the psychologist's school is only 101, a single mean point higher than the national average of 100. It would seem then that the computation of t is telling us a story that is consistent with our intuition, that there is no reason to believe that the school's performance is higher than that of the national average in the population from which these sample data were drawn.
Now, consider what would have happened had the psychologist collected a larger sample, suppose n = 500. Using our new sample size, and still assuming an estimated population standard deviation s equal to 10 and a distance between means equal to 1, we repeat the computation for t:
What happened? The obtained value of t increased from 1 to 2.22 simply as a result of collecting a larger sample, nothing more. The actual distance between means remained the same (101−100 = 1). The degrees of freedom for the test have changed and are now equal to 499 (i.e., n − 1 = 500 − 1 = 499). Since our obtained t of 2.22 exceeds critical t, our statistic is deemed statistically significant at p < 0.05. What is important to realize is that we did not change the difference between the sample mean
The problem is not that the significance test is not useful and therefore should be banned. The problem is that too few are aware that the statement “p < 0.05,” in itself, scientifically (as opposed to statistically) may have little meaning in a given research context, and at worst, may be entirely misleading if automatically assigned any degree of scientific importance by the interpreter.
2.28.4 Other Test Statistics
The factors that influence the size of a p‐value are, of course, not only relevant to z‐ and t‐tests, but are at work in essentially every test of statistical significance we might conduct. For instance, as we will see in the following chapter, the size of the F‐ratio in traditional one‐way ANOVA is subject to the same influences. Taken as the ratio of MS between to MS error, the three determining influences for the size of p are (1) size of MS between, which is a reflection of the extent to which means are different from group to group, (2) size of MS error, which is in part a reflection of the within‐group variability, and (3) sample size (when computing MS error, we divide the sum of squares for error by degrees of freedom, in which the degrees of freedom are determined in large part by sample size). Hence, a large F‐stat does not necessarily imply that MS between is absolutely large, no more than a large t necessarily implies the size of
These ideas for significance tests apply in even the most advanced of modeling techniques, such as structural equation modeling (see Chapter 15). The typical measure of model fit is the chi‐square statistic, χ2, which as reported by many (e.g., see Bollen, 1989; Hoelter, 1983) suffers the same interpretational problems as t and F regarding how its magnitude can be largely a function of sample size. That is, one can achieve a small or large χ2 simply because one has used a small or large sample. If a researcher is not aware of this fact, he or she may decide that a model is well‐fitting or poor‐fitting based on a small or large chi‐square value, without awareness of its connection with n. This is in part why other measures, as we will see, have been proposed for interpreting the fit of SEM models (e.g., see Browne and Cudeck, 1993).
2.28.5 The Solution
The solution to episodes of misunderstanding the significance test is not to drop or ban it, contrary to what some have recommended (e.g., Hunter, 1997). Rather, the solution is to supplement it with a measure that accounts for the actual distance between means and serves to convey the magnitude of the actual scientific finding,