Practical Data Analysis with JMP, Third Edition. Robert Carver. Читать онлайн. Hotlib. HOTLIB.NET

Practical Data Analysis with JMP, Third Edition

of a data table containing experimental data, open the data table called Concrete. Professor I-Cheng Yeh of Chung-Hua University in Taiwan measured the compressive strength of concrete prepared with varying formulations of seven different component materials. Compressive strength is the amount of force per unit of area, measured here in megapascals that the concrete can withstand before failing. Think of the concrete foundation walls of a building; they need to be able to support the mass of the building without collapsing. The purpose of the experiment was to develop an optimal mixture to maximize compressive strength.

1. Select File ► Open. Choose Concrete and click OK.

Figure 2.4: The Concrete Data Table (Within Project)

The first seven columns in the data table in Figure 2.4 represent factor variables. Professor Yeh selected specific quantities of the seven component materials and then tested the compressive strength as the concrete aged. The eighth column, Age, shows the number of days elapsed since the concrete was formulated, and the ninth column is the response variable, Compressive Strength. In the course of his experiments, Professor Yeh repeatedly tested different formulations, measuring compressive strength after varying numbers of days. To see the structure of the data, let’s look more closely at the two columns.

8. Choose Analyze ► Distribution.

9. As shown in Figure 2.5, select the Cement and Age columns and click OK.

Figure 2.5: The Distribution Dialog Box

The Distribution platform generates two graphs and several statistics. We will study these in detail in Chapter 3. For now, you just need to know that the graphs, called histograms, display the variation within the two data columns. The lengths of the bars indicate the number of rows corresponding to each value. For example, there are many rows with concrete mixtures containing about 150 kg of cement, and very few with 100 kg.

10. Move your cursor over the Cement histogram and click the bar corresponding to values just below 400 kg of cement.

When you click that bar, the entire bar darkens, indicating that the rows corresponding to mixtures with that quantity (380 kg of cement, it turns out) are selected. Additionally, small portions of several bars in the Age histogram are also darkened, representing the same rows.

11. To see the relationships between a histogram bar, the data table, and the other histogram, we want to make the Distribution Report and Data Table visible at the same time. Move your cursor over the tab labeled Concrete – Distribution of Cement, Age. Click and drag downward; release the button over the Dock Above zone. The windows should look like Figure 2.6. Save your project once more.

Figure 2.6: Select Rows by Selecting One Bar in a Graph

Finally, look at the data table. Within the data grid, several visible rows are highlighted. All of them share the same value of Cement. Within the Rows panel, we see that we have now altered the row state of 76 rows by selecting them.

Look more closely at rows 7 and 8, which are the first two of the selected rows. Both represent seven-part mixtures following the same recipe of the seven materials. Observation #7 was taken at 365 days and observation #8 at 28 days. These observations are not listed chronologically, but rather are a randomized sequence typical of an experimental data table.

Row 16 is the next selected row. This formulation shares several ingredients in the same proportion as rows 7 and 8, but the amounts of other ingredients differ. This is also typical of an experiment. The researcher selected and tested different formulations. Because the investigator, Professor Yeh, manipulated the factor values, we find this type of repetition within the data table. And because Professor Yeh was able to select the factor values deliberately and in a controlled way, he was able to draw conclusions about which mixtures will yield the best compressive strength.

12. Before proceeding to the next section, restore the Distribution report to its original position. Move your cursor over the title bar Concrete – Distribution of Cement, Age. Click, drag, and release near the center of the screen over the Dock Tab drop zone.

Observational Data—An Example

Of course, experimentation is not always practical, ethical, or legal. Medical and pharmaceutical researchers must follow extensive regulations, can experiment with dosages and formulations of medications, and may randomly assign patients to treatment groups, but cannot randomly expose patients to diseases. Investors do not have the ability to manipulate stock prices (if they do and get caught, they go to prison).

Now open the data table called Stock Index Weekly. This data table contains time series data for six different major international stock markets during 2008, the year in which a worldwide financial crisis occurred. A stock index is a weighted average of the prices of a sample of publicly traded companies. In this table, we find the weekly index values for the following stock market indexes, as well as the average number of shares traded per week:

● Nikkei 225 Index – Tokyo

● FTSE100 Index – London

● S&P 500 Index – New York

● Hang Seng Index – Hong Kong

● IGBM Index – Madrid

● TA100 Index – Tel Aviv

The first column is the observation date. Note that the observations are basically every seven days after the second week. The other columns simply record the index and volume values as they occurred. It was not possible to set or otherwise control any of these values.

Survey Data—An Example

Survey research is conducted in the social sciences, in public health, and in business. In a survey, researchers pose carefully constructed questions to respondents, and then the researchers or the respondents themselves record the answers. In survey data, we often find coding schemes where categorical values are assigned numeric codes or other abbreviations. Sometimes continuous variables, like income or age, are converted to ordinal columns.

As an example of a well-designed survey data table, we have a small portion of the responses provided in the 2006 administration of the National Health and Nutrition Examination Survey (NHANES). NHANES is an annual survey of people in the U.S. that looks at their diet, health, and lifestyle choices. The Centers for Disease Control (CDC) posts the raw survey data for public access.2

1. Open the NHANES 2016 data table.

As shown in Figure 2.7, this data table contains some features that we have not seen before. As usual, each column holds observations for a single variable, but these column names are not very informative. Typical of many large-scale surveys, the NHANES website provides both the data and a data dictionary that defines each variable, including information about coding and units of measurement. For our use in this book, we have added some notations to each column.

Figure 2.7: NHANES Data Table

13. In the Columns panel, move your cursor to highlight the column named RIAGENDR and right-click. Select Column info… to open the dialog box shown in Figure 2.8.

Figure 2.8: Customizing Column Metadata

When this dialog box opens, we find that this column holds numeric data, and its modeling type is nominal—which might seem surprising since the rows in the data table all say Male or Female. In fact, the lower portion of the column info dialog box shows us what is going on here. This is an example of coded data: the

Скачать книгу