the preceding section, we had to make some pretty restrictive assumptions (normality, known mean, known variance) in order to make statistical inferences. We now explore the connection between samples and populations a little more closely so that we can draw conclusions using fewer assumptions.
Recall that the population is the entire collection of objects under consideration, while the sample is a (random) subset of the population. Sometimes we may have a complete listing of the population (a census), but most of the time a census is too expensive and time consuming to collect. Moreover, it is seldom necessary to consider an entire population in order to make some fairly strong statistical inferences about it using just a random sample.
We are particularly interested in making statistical inferences not only about values in the population, denoted
Next we will see how statistical inference essentially involves estimating population parameters (and assessing the precision of those estimates) using sample statistics. When our sample data is a subset of the population that has been selected randomly, statistics calculated from the sample can tell us a great deal about corresponding population parameters. For example, a sample mean tends to be a good estimate of the population mean, in the following sense. If we were to take random samples over and over again, each time calculating a sample mean, then the mean of all these sample means would be equal to the population mean. (There may seem to be a surfeit of “means” in that last sentence, but if you read it slowly enough it will make sense.) Such an estimate is called unbiased since on average it estimates the correct value. It is not actually necessary to take random samples over and over again to show this—probability theory (beyond the scope of this book) allows us to prove such theorems without expending the time and expense of administering a large number of samples.
However, it is not enough to just have sample statistics (such as the sample mean) that average out (over a large number of hypothetical samples) to the correct target (i.e., the population mean). We would also like sample statistics that would have “low” variability from one hypothetical sample to another. At the very least we need to be able to quantify this variability, known as sampling uncertainty. One way to do this is to consider the sampling distribution of a statistic, that is, the distribution of values of a statistic under repeated (hypothetical) samples. Again, we can use results from probability theory to tell us what these sampling distributions are. So, all we need to do is take a single random sample, calculate a statistic, and we will know the theoretical sampling distribution of that statistic (i.e., we will know what the statistic should average out to over repeated samples, and how much the statistic should vary over repeated samples).
1.4.1 Central limit theorem—normal version
Suppose that a random sample of
An amazing fact about the central limit theorem is that there is no need for the population itself to be normal (remember that we had to assume this for the calculations in Section 1.3). However, the more symmetric the distribution of the population, the better is the normal approximation for the sampling distribution of the sample mean. Also, the approximation tends to be better the larger the sample size
So, how can we use this information? Well, the central limit theorem by itself will not help us to draw statistical inferences about the population without still having to make some restrictive assumptions. However, it is certainly a step in the right direction, so let us see what kind of calculations we can now make for the home prices example. As in Section 1.3, we will assume that
First, we need to get some notation straight. In this section, we are not thinking about the specific sample mean we got for our actual sample of 30 sale prices,