Lord Rayleigh's Data
In Exploratory Data Analysis, Tukey (1977) demonstrates the box‐and‐whiskers plot using the Lord Rayleigh data, which measure the weight of nitrogen gas obtained by various means; see Table 1.1. Discrepancies in the results led to his discovery of the element argon. Rayleigh made
Table 1.1 Lord Rayleigh's 24 measurements (sorted) of the weight of a sample of nitrogen. The first 10 came from chemical samples, while the last 14 came from pure air.
2.29816 | 2.29849 | 2.29869 | 2.29889 | 2.29890 |
2.29940 | 2.30054 | 2.30074 | 2.30143 | 2.30182 |
2.30956 | 2.30986 | 2.31001 | 2.31010 | 2.31010 |
2.31012 | 2.31017 | 2.31024 | 2.31024 | 2.31026 |
2.31027 | 2.31028 | 2.31035 | 2.31163 |
Figure 1.3 Displays of Lord Rayleigh's 24 measurements of the atomic weight of nitrogen gas. (Left) Histogram with four bins; (middle) a second histogram; (right) stem‐and‐leaf display using the
In the left frame of Figure 1.3, we display a histogram with four (carefully selected) bins. The histogram is shown on a density scale, rather than a frequency scale, so that the area of the shaded region is 1. We shall see in Problem 1 that this is accomplished by dividing the bin counts by
The first histogram in Figure 1.3 hides the interesting structure contained in the small dataset. The second histogram and stem‐and‐leaf plot show the two clusters quite clearly. Charting of data before the 1900s was not common, and looking at a table of the data would typically not reveal this feature. It turned out that Lord Rayleigh had combined various sources of the gas with several purifying agents and extraction methods. The samples originating from “pure air” were “contaminated” with argon. For the discovery of argon, Lord Rayleigh was awarded the Nobel Prize in Physics in 1904.
1.1.3 Discussion
Finding structure in data is a primary goal of data science. Graphical methods are powerful approaches to discovering unexpected or hidden structure. Some of these methods are better suited to small datasets. In a multivariate statistics course, we will learn how to analyze data with more than one variable. Modern genetic datasets often result in more than
1.2 Exploring Prediction Using Data
The second fundamental task of statistics is prediction. Data for this task are typically ordered pairs,
The initial step is to plot a scatter diagram of the
1.2.1 Body and Brain Weights of Land Mammals
In the left frame of Figure 1.4, we plot the brain and body weights of 62 land mammals from the
However, Tukey introduced a power transformation ladder to re‐express a variable
see Problem 3 for an explanation of why
In the right frame of Figure 1.4, we use the log function to dramatic effect. There clearly is a strong relationship that allows highly accurate prediction of the log(brain weight) of a land mammal knowing its log(body weight). (The body weight