Robert Carver

Practical Data Analysis with JMP, Third Edition


Скачать книгу

      Chapter 2: Data Sources and Structures

       Overview

       Populations, Processes, and Samples

       Representativeness and Sampling

       Cross-Sectional and Time Series Sampling

       Study Design: Experimentation, Observation, and Surveying

       Creating a Data Table

       Raw Case Data and Summary Data

       Application

      This chapter is about data. More specifically, it is about how we make choices when we gather or generate data for analysis in a statistical investigation, and how we store that data in JMP files for further analysis. This chapter introduces some data storage concepts that are further developed in Appendix B. How and where we gather data are the foundation for what types of conclusions and decisions we can draw. After reading this chapter, you should have a solid foundation to build upon.

      We analyze data because we want to understand something about variation within a population—a collection of people or objects or phenomena that we are interested in. Sometimes we are more interested in the variability of a process—an ongoing natural or artificial activity (like the occurrences of earthquakes or fluctuating stock prices). We identify one or more variables of interest and either count or measure them.

      In most cases, it is impossible or impractical to gather data from every single individual within a population or every instance from an ongoing process. Marine biologists who study communication among dolphins cannot possibly measure every living dolphin. Manufacturers wanting to know how quickly a building material degrades in the outdoors cannot destroy 100% of their products through testing or they will have nothing left to sell. Thanks to huge databases and powerful software, financial analysts interested in future performance of a stock can analyze every single past trade for that stock, but they cannot analyze trades that have not yet occurred.

      Instead, we typically analyze a sample of individuals selected from a population or process. Ideally, we choose the individuals in such a way that we can have some confidence that they are a “second-best” substitute or stand-in for the entire population or ongoing process. In later chapters, we will devote considerable attention to the adjustments that we can make before generalizing sample-based conclusions to a larger population.

      As we begin to learn about data analysis, it is important to be clear about the roles and relationship of a population and a sample. In many statistical studies, the situation is as follows (this example refers to a population, but the same applies to a process):

      ● We are interested in the variation of one or more attributes of a population. Depending on the scenario, we may wish to anticipate the variation, influence it, or just understand it better.

      ● Within the context of the study, the population consists of individual observational units, or cases. We cannot gather data from every single individual within the population.

      ● Hence, we choose some individuals from the population and observe them to generalize about the entire population.

      We gather and analyze data from the sample of individuals instead of doing so for the whole population. The individuals within the sample are not the group that we are ultimately interested in knowing about. We really want to learn about the variability within the population. When we use a sample instead of a population, we run some risk that the sample will misrepresent the population—and that risk is at the center of statistical reasoning.

      Depending on just what we want to learn about a process or population, we also concern ourselves with the method by which we generate and gather data. Do we want to characterize or describe the extent of temperature variation that occurs in one part of the world? Or do we want to understand how patients with a disease respond to a specific dosage of a medication? Or do we want to predict which incentives are most likely to induce consumers to buy a product?

      There are two types of generalizations that we might eventually want to be able to make, and in practice, statisticians rely on randomization in data generation as the logical path to generalization. If we want to use sample data to characterize patterns of variation in an entire population or process, it is valuable to select sample cases using a probability-based or random process. If we ultimately want to draw definitive conclusions about how variation in one variable is caused or influenced by another, then it is essential to randomly control or assign values of the suspected causal variable to individuals within the sample.

      Therefore, the design strategy in sampling is terrifically important in the practical analysis of data. We also want to think about the time frame for sampling. If we are interested in the current state of a population, we should select a cross-section of individuals at one time. On the other hand, if we want to see how a process unfolds over time, we should select a time series sample by observing the same individual repeatedly at specific intervals. Some samples (often referred to as panel data) consist of a list of individuals observed repeatedly at a regular interval.

      If we plan to draw general conclusions about a population or process from one sample, it’s important that we can reasonably expect the sample to represent the population. Whenever we rely on sample information, we run the risk that the sample could misrepresent the population. (In general, we call this sampling error.) Statisticians have several standard methods for choosing a sample. No one method can guarantee that a single sample accurately represents the population, but some methods carry smaller risks of sampling error than others. What’s more, some methods have predictable risks of sampling error, while others do not. As you will see later in the book, if we can predict the extent and nature of the risk, then we can generalize from a sample; if we cannot, we sacrifice our ability to generalize. JMP can accommodate different methods of representative sampling, both by helping us to select such samples and by taking the sampling method into account when analyzing data. At this point, we focus on understanding different approaches to sampling by examining data tables that originated from different designs. We will also take a first look at using JMP to select representative samples. In Chapters 8 and 21, we will revisit the subject more deeply.

      Simple Random Sampling

      The logical starting point for a discussion of representative sampling is the simple random sample (SRS). Imagine a population consisting of N elements (for example, a lake with N = 1,437,652 fish), from which we want to take an SRS of n = 200 fish. With a little thought, we recognize that there are many different 200-fish samples that we might draw from the lake. If we use a sampling method that ensures that all 200-fish samples have the same chance of being chosen, then any sample we take with that method is an SRS. Essentially, we depend on the probabilities involved in random sampling to produce a representative sample.

      Simple random sampling requires that we have a sampling frame, or a list of all members of a population. The sampling frame could be a list of students in a university, firms in an industry, or members of an organization. To illustrate, we will start with a list of the countries in the world and see one way to select an SRS. For the sake of this example, suppose we want to draw a simple random sample of 20 countries for in-depth research.

      There are several ways to select and isolate a simple random sample drawn from a JMP data table. In this illustration, we will first randomly