Robert J. Marzano

Making Classroom Assessments Reliable and Valid


Скачать книгу

to measure. For large-scale assessments, this tends to create a problem from the outset since most large-scale assessments are designed to measure entire subject areas for a particular grade level. For example, a state test in English language arts (ELA) at the eighth-grade level is designed to measure all the content taught at that level. A quick analysis of the content in eighth-grade ELA demonstrates the problem.

      According to Robert J. Marzano, David C. Yanoski, Jan K. Hoegh, and Julia A. Simms (2013), there are seventy-three eighth-grade topics for ELA in the CCSS. Researchers and educators refer to these as elements. Each of these elements contains multiple embedded topics, which means that a large-scale assessment must have multiple sections to be considered a valid measure of those topics.

      Of course, sampling techniques would allow large-scale test designers to address a smaller subset of the seventy-three elements. However, validity is still a concern. To cover even a representative sample of the important content would require a test that is too long to be of practical use. As an example, assume that a test was designed to measure thirty-five (about half) of the seventy-three ELA elements for grade 8. Even if each element had only five items, the test would still contain 175 items, rendering it impractical for classroom use.

      Relative to validity, CAs have an advantage over large-scale assessments in that they can and should be focused on a single topic (technically referred to as a single dimension). In fact, making assessments highly focused in terms of the content they address is a long-standing recommendation from the assessment community to increase validity (see Kane, 2011; Reckase, 1995). This makes intuitive sense. Since CAs will generally focus on one topic or dimension over a relatively short period, teachers can more easily ensure that they have acceptable levels of validity. Indeed, recall from the previous discussion that some measurement experts contend that CAs have such high levels of validity that we should not be concerned about their seemingly poor reliability.

      The aspect of CA validity that is more difficult to address is that all tests within a set must measure precisely the same topic and contain items at the same levels of difficulty. This requirement is obvious if one examines the scores depicted in figure I.2. If these scores are to truly depict a given student’s increase in his or her true score for the topic being measured, then educators must design the tests to be as identical as possible. If for example, the fourth test in figure I.2 is much more difficult than the third test, a given student’s observed score on that fourth test will be lower than the score on the third test even though the student’s true score has increased (the student has learned relative to the topic of the tests).

      Sets of tests designed to be close to one another in the topic measured and the levels of difficulty of the items are referred to as parallel tests. In more technical terms, parallel tests measure the same topic and have the same types of items both in format and difficulty levels. I address how to design parallel tests in depth in chapters 2 and 3 (pages 39 and 59, respectively). Briefly, though, the more specific teachers are regarding the content students are to master and the various levels of difficulty, the easier it is for them to design parallel tests. To do this, a teacher designing a test must describe in adequate detail not only the content that demonstrates proficiency for a specific standard but also simpler content that will be directly taught and is foundational to demonstrating proficiency. Additionally, it is important to articulate what a student needs to know and do to demonstrate competence beyond the target level of proficiency. To illustrate, consider the following topic that might be the target for third-grade science.

      Students will understand how magnetic forces can affect two objects not in contact with one another.

      To make this topic clear enough that teachers can design multiple assessments that are basically the same in terms of the content and its levels of difficulty, it is necessary to expand this to a level of detail depicted in table I.3, which provides three levels of content for the topic. The target level clearly describes what students must do to demonstrate proficiency. The basic level identifies important, directly taught vocabulary and basic processes. Finally, the advanced level describes a task that demonstrates students’ ability to apply the target content.

Level of Content Content
Advanced Students will design a device that uses magnets to solve a problem. For example, students will be asked to identify a problem that could be solved using the attracting and repelling qualities of magnets, and create a prototype of design.
Target Students will learn how magnetic forces can affect two objects not in contact with one another. For example, students will determine how magnets interact with other objects (including different and similar poles of other magnets), and experiment with variables that affect these interactions (such as orientation of magnets and distance between material or objects).
Basic Students will recognize or recall specific vocabulary, such as attraction, bar magnet, horseshoe magnet, magnetic field, magnetic, nonmagnetic, north pole, or south pole. Students will perform basic processes, such as: • Explain that magnets create areas of magnetic force around them • Explain that magnets always have north and south poles • Provide example of magnetic and nonmagnetic materials • Explain how two opposite poles interact (attracting) and two opposite poles interact (repelling) • Identify variables that affect strength of magnetic force (for example, distance between objects, or size)

      Source: Adapted from Simms, 2016.

      The teacher now has three levels of content, all on the same topic, that provide specific directions on how to create classroom assessments on the same topic and the same levels of difficulty. I discuss how classroom teachers can do this in chapter 2 (page 39).

      Teachers and administrators for grades K–12 will learn how to revamp the concepts of validity and reliability so they match the technical advances made in CA, instead of matching large-scale assessment’s traditional paradigms for validity and reliability. This introduction lays the foundation. It introduces the new validity and reliability paradigms constructed for CAs. Chapters 15 describe these paradigms in detail. Chapter 1 covers the new CA paradigm for validity, noting the qualities of three major types of validity and two perspectives teachers can take regarding classroom assessments. Chapter 2 then conveys the variety of CAs that teachers can use to construct parallel assessments, which measure students’ individual growth. Chapter 3 addresses the new CA paradigm for reliability and how it shifts from the traditional conception of reliability; it presents three mathematical models of reliability. Then, chapter 4 expresses how to measure groups of students’ comparative growth and what purposes this serves. Finally, chapter 5 considers helpful changes to report cards and teacher evaluations based on the new paradigms for CAs. The appendix features formulas that teachers, schools, and districts can use to compute the reliability of CAs in a manner that is comparable to the level of precision