receive completely different scores. A reliability of 1.00 indicates that scores on the test are perfectly accurate for each student. This means that the scores contain no errors. If students took the same test right after they had completed it the first time, they would receive precisely the same scores. Fortunately, no externally designed assessments have a reliability of 0.00. Unfortunately, no externally designed assessments have a reliability of 1.00—simply because it is impossible to construct such a test.
Most externally designed assessments have reliabilities of about 0.85 or higher. Unfortunately, even with a relatively high reliability, the information a test provides about individuals has a great deal of error in it, as figure I.2 shows.
Note: The standard deviation of this test was 15, and the upper and lower limits have been rounded.
Figure I.2: Reliabilities and 95 percent confidence intervals.
Figure I.2 depicts the degree of precision of individual students’ scores across five levels of reliability: 0.45, 0.55, 0.65, 0.75, and 0.85. These levels represent the range of reliabilities one can expect for assessments students will see in K–12 classrooms. At the low end are assessments with reliabilities of 0.45. These might be hastily designed assessments that teachers create. At the high end are externally designed assessments with reliabilities of 0.85 or even higher. The second column represents the observed score, which is 70 in all situations. The third and fourth columns represent the lower limit and upper limit of a band of scores into which we can be 95 percent sure that the true score falls. The range represents the size of the 95 percent confidence interval.
The pattern of scores in figure I.2 indicates that as reliability goes down, one has less and less confidence in the accuracy of the observed score for an individual student. For example, if the reliability of an assessment is 0.85, we can be 95 percent sure that the student’s true score is somewhere between eleven points lower than the observed score and eleven points higher than the observed score, for a range of twenty-two points. However, if the reliability of an assessment is 0.55, we can be 95 percent sure that the true score is anywhere between twenty points lower than the observed score and twenty points higher than the observed score.
These facts have massive implications for how we design and interpret assessments. Consider the practice of using one test to determine if a student is competent in a specific topic. If the test has a reliability of 0.85, an individual student’s true score could be eleven points higher or lower than the observed score. If the test has a reliability of 0.55, an individual student’s true score could be twenty points higher or lower than the observed score. Making the situation worse, in both cases we are only 95 percent sure the true score is within the identified lower and upper limits. We cannot overstate the importance of this point. All too often and in the name of summative assessment, teachers use a single test to determine if a student is proficient in a specific topic. If a student’s observed score is equal to or greater than a set cut score, teachers consider the student to be proficient. If a student’s score is below the set cut score, even by a single point, teachers consider the student not to be proficient.
Examining figure I.2 commonly prompts the question, Why are assessments so imprecise regarding the scores for individual students even if they have relatively high reliabilities? The answer to this question is simple. Test makers designed and developed CTT with the purpose of scoring groups of students as opposed to scoring individual students. Reliability coefficients, then, tell us how similar or different groups of scores would be if students retook a test. They cannot tell us about the variation in scores for individuals. Lee J. Cronbach (the creator of coefficient alpha, one of the most popular reliability indices) and his colleague Richard J. Shavelson (2004) strongly emphasize this point when they refer to reliability coefficients as “crude devices” (p. 394) that really don’t tell us much about individual test takers.
To illustrate what reliability coefficients tell us, consider figure I.3.
Source: Marzano, 2018, p. 62.
Figure I.3: Three administrations of the same test.
Figure I.3 illustrates precisely what a traditional reliability coefficient means. The first column, Initial Administration, reports the scores of ten students on a specific test. The second column, Second Administration (A), represents the scores from the same students after they have taken the test again. But before students took the test the second time, they forgot that they had taken it the first time, so the items appear new to them. While this cannot occur in real life and seems like a preposterous notion, it is, in fact, a basic assumption underlying the reliability coefficient. As Cronbach and Shavelson (2004) note:
If, hypothetically, we could apply the instrument twice and on the second occasion have the person unchanged and without memory of his first experience, then the consistency of the two identical measurements would indicate the uncertainty due to measurement error. (p. 394)
The traditional reliability coefficient simply tells how similar the score set is between the first and second test administrations. In figure I.3, the scores on the first administration and the second administration (A) are quite similar. Student 1 receives a 97 on the first administration and a 98 on the second administration; student 2 receives a 92 and a 90 respectively, and so on. There were some differences in scores but not much. The last row of the table shows the correlation between the initial administration and the second administration. That correlation (0.96) is, in fact, the reliability coefficient, and it is quite high.
But let’s now consider another scenario, as we depict in the last column of figure I.3, Second Administration (B). In this scenario, students receive very different scores on the second administration. Student 1 receives a score of 97 on the first administration and a score of 82 on the second; student 2 receives a 92 and 84 respectively. If the second administration of the test produces a vastly different pattern of scores, we would expect the correlation between the two administrations (or the reliability coefficient) to be quite low, which it is. The last row of the table indicates that the reliability coefficient is 0.32.
So how can educators obtain precise scores for individual students using classroom assessments? The answer to this question is that they can design multiple assessments and administer them over time.
Multiple Assessments
The preceding discussion indicates that as long as we think of tests as independent events, the scores from which educators must interpret in isolation, there is little hope for precision at the individual student level. However, if one changes the perspective from a single assessment to multiple assessments administered and interpreted over time, then it becomes not only possible but relatively straightforward to generate a relatively precise summary score for individuals.
To illustrate, consider the following five scores for an individual student on a specific topic gathered over the course of a grading period.
70, 72, 75, 77, 81
We have already discussed that any one of these scores in isolation probably does not provide a great deal of accuracy. Recall from figure I.2 (page 3) that even if all test reliabilities were 0.85, we would have to add and subtract about