in Europe.
The second task is that of prediction. A bank may wish to understand how credit risk is related to other information that may be available. A mechanical engineer may wish to understand the risk inherent in a new design under extreme conditions. Methods for performing this task underlie many algorithms today, for example, translating foreign languages or image recognition.
The mathematical backbone of all of our statistical methods is probability theory. Thus we study the basics of probability theory and random variables in the first part of this course. Statistical methods and the basics of statistical decision theory form the core of the middle third of this course. Specific tests and data analysis approaches finish our study.
1.1 Exploring the Distribution of Data
Tukey (1977) introduced a number of data summaries in his book Exploratory Data Analysis. Many are based on quantiles or percentiles of the data vector. Percentiles are particular choices of the sorted data. The middlemost is the median, or the 50th percentile. As a measure of spread, Tukey focused on the distance from the 25th to the 75th percentiles, the so‐called interquartile range (IQR). A three‐point summary would list these percentiles. Instead Tukey popularized the box‐and‐whiskers plot, which is a five‐point summary. The additional two points are intended to capture 99% of the data. These are drawn at a distance of
1.1.1 Pearson's Father–Son Height Data
We illustrate these ideas on a set of data collected by Karl Pearson over a century ago. He recorded the heights of
In the middle frame of Figure 1.1, we show Tukey's stem‐and‐leaf plot of the 1078 differences of the heights of each son and his father. The range of the data is
In the right frame of Figure 1.1, we show the frequency counts in a histogram. The histogram uses a parameter
hist
is Sturges' rule, discussed in Section 9.1.4.3, which chooses 11 bins with The choice of
Figure 1.1 Displays of the father–son height data collected by Karl Pearson: (left) box‐and‐whiskers plot; (middle) stem‐and leaf plot; (right) histogram.
Figure 1.2 Histograms of the sons' heights (top row) and fathers' heights (bottom row) using three bin widths:
1.1.2