Robert Carver

Practical Data Analysis with JMP, Third Edition


Скачать книгу

Within this sample, are the different car models equally favored in the three different metropolitan areas? Discuss your analysis and explain what you have found.

      3. Scenario: High blood pressure continues to be a leading health problem in the U.S. We have a data table (NHANES 2016) containing survey data from nearly 10,000 people in the U.S. in 2017. For this analysis, we will focus on only the following variables:

      - RIAGENDR: Respondent’s gender

      - RIDAGEYR: Respondent’s age in years

      - RIDRETH1: Respondent’s racial or ethnic background

      - BMXWT: Respondent’s weight in kilograms

      - BPXPLS: Respondent’s resting pulse rate

      - BPXSY1: Respondent’s systolic blood pressure (“top” number in BP)

      - BPXD1: Respondent’s diastolic blood pressure (“bottom” number in BP)

      a. Create a scatterplot of systolic blood pressure versus age. Within this sample, what tends to happen to blood pressure as people age?

      b. Compute and report the correlation between systolic and diastolic blood pressure. What does this correlation tell you?

      c. Use either a bubble plot to incorporate gender (Color) and pulse rate (Bubble Size) into the graph. Comment on what you see.

      d. Compare the distribution of systolic blood pressure in males and females. Report on what you find.

      e. Compare the distribution of systolic blood pressure by racial/ethnic background. Comment on any noteworthy differences that you find.

      f. Create a scatterplot of systolic blood pressure and pulse rate. One might suspect that higher pulse rate is associated with higher blood pressure. Does the analysis bear out this suspicion?

      4. Scenario: Despite well-documented health risks, tobacco is used widely throughout the world. The Tobacco data table provides information about the several variables for 133 different nations in 2005, including these:

      - TobaccoUse: Prevalence of tobacco use (%) among adults 18 and older (both sexes)

      - Female: Prevalence of tobacco use among females, 18 and older

      - Male: Prevalence of tobacco use among males, 18 and older

      - CVMort: Age-standardized mortality rate for cardiovascular diseases (per 100,000 population in 2002)

      - CancerMort: Age-standardized mortality rate for cancers (per 100,000 population in 2002)

      a. Compare the prevalence of tobacco use across the regions of the world, and comment on what you see.

      b. Create a scatterplot of cardiovascular mortality versus prevalence of tobacco use (both sexes). Within this sample, describe the relationship, if any, between these two variables.

      c. Create a scatterplot of cancer mortality versus prevalence of tobacco use (both sexes). Within this sample, describe the relationship, if any, between these two variables.

      d. Compute and report the correlation between male and female tobacco use. What does this correlation tell you?

      e. Create a bubble plot to modify your scatterplot from item c above to augment the display to incorporate region (color) and cardiovascular mortality (bubble size). Comment on what you find in the graph.

      5. Scenario: Since 2003, the U.S. Bureau of Labor Statistics has been conducting the biennial American Time Use Survey. Census workers use telephone interviews and a complex sampling design to measure the amount of time people devote to various ordinary activities. We have some of the survey data in the data table called TimeUse. Our data table contains observations from more than 43,191 different respondents in 2003, 2007, and 2017. For these questions, use the Data Filter to select and include just the 2017 responses.

      a. Create a crosstabulation of employment status by sex, and report on what you find.

      b. Create a crosstabulation of full versus part-time employment status by gender, and report on what you find.

      c. Compare the distribution of time spent sleeping across the employment categories. Report on what you find.

      d. Now change the data filter to include all rows. Compare the distribution of time spent on personal email in 2003, 2007, and 2017. Comment on your findings.

      6. Scenario: The data table Sleeping Animals contains information about the sizes, life spans, and sleeping habits of various mammal species. The last few columns are ordinal variables classifying the animals according to their comparative risks of being victimized by predators and the degree to which they sleep in the open rather than in enclosed spaces.

      a. Create a crosstabulation of predation index by exposure index, and report on what you find.

      b. Compare the distribution of hours of sleep across values of the danger index. Report on what you find.

      c. Create a scatterplot of total sleep time and life span for these animals. What does the graph tell you?

      d. Compute the correlation between total sleep time and life span for these animals. What does the correlation tell you?

      7. Scenario: Let’s return to the data table FAA Bird Strikes CA. The FAA includes categorical variables pertaining to the number of birds struck, the size of the birds struck, and the general weather conditions.

      a. Create a crosstabulation of number of birds struck versus sky conditions (Sky), and report on what you find.

      b. Create a crosstabulation of number of birds struck versus the precipitation conditions (Precip), and report on what you find.

      c. Investigate the relationship between the number of birds struck and the speed of the aircraft. Write a sentence to describe that relationship.

      d. Investigate the relationship between the number of birds struck and the height of the aircraft. Write a sentence to describe that relationship.

      8. Scenario: Every ten years, the United States conducts a census of the population, gathering considerable data about the nation and its residents. The data table called USA Counties contains demographic, economic, commercial, educational, and other data about each of the 3,143 counties in the United States as of 2010.

      a. Create a scatterplot of median household income (Y) versus percent of the population with a bachelor’s degree. Comment on what you see.

      b. Compute and report the line of best fit for these data. Use that line to estimate the median household income in counties with 25% of the population holding bachelor’s degrees.

      c. Create and report on a scatterplot between the percentage of households where a foreign language is spoken in the home (foreign_spoken_at_home) and the percentage of households with a foreign-born member (foreign_born). How do you explain the distinctive pattern in the graph?

      d. Compute and explain the correlation coefficient for the two variables in item c above.

      e. Estimate the line of best fit using the population as determined by the 2010 US Census as Y and the 2000 population count as X. Think about the slope of this line. What does it tell us about what happened to the average of US counties’ populations between 2000 and 2010?

      f. The point representing Cook County, Illinois, is distinctive in that it lies below the red estimated line (2000 population was 5,194,675). According to this fitted line, what was unusual about Cook County in comparison to other counties of the United States?

      Chapter 5: Review of Descriptive Statistics

      Overview 87

      The World Development Indicators 87

      Questions for Analysis 88

      Applying an Analytic Framework 89

      Preparation for Analysis 92

      Univariate Descriptions 92

      Explore