Iain Pardoe

Applied Regression Modeling


Скачать книгу

the preceding section, we had to make some pretty restrictive assumptions (normality, known mean, known variance) in order to make statistical inferences. We now explore the connection between samples and populations a little more closely so that we can draw conclusions using fewer assumptions.

      Recall that the population is the entire collection of objects under consideration, while the sample is a (random) subset of the population. Sometimes we may have a complete listing of the population (a census), but most of the time a census is too expensive and time consuming to collect. Moreover, it is seldom necessary to consider an entire population in order to make some fairly strong statistical inferences about it using just a random sample.

      We are particularly interested in making statistical inferences not only about values in the population, denoted images, but also about numerical summary measures such as the population mean, denoted images—these population summary measures are called parameters. While population parameters are unknown (in the sense that we do not have all the individual population values and so cannot calculate them), we can calculate similar quantities in the sample, such as the sample mean—these sample summary measures are called statistics. (Note the dual use of the term “statistics.” Up until now it has represented the notion of a general methodology for analyzing data based on probability theory, and just now it was used to represent a collection of summary measures calculated from sample data.)

      However, it is not enough to just have sample statistics (such as the sample mean) that average out (over a large number of hypothetical samples) to the correct target (i.e., the population mean). We would also like sample statistics that would have “low” variability from one hypothetical sample to another. At the very least we need to be able to quantify this variability, known as sampling uncertainty. One way to do this is to consider the sampling distribution of a statistic, that is, the distribution of values of a statistic under repeated (hypothetical) samples. Again, we can use results from probability theory to tell us what these sampling distributions are. So, all we need to do is take a single random sample, calculate a statistic, and we will know the theoretical sampling distribution of that statistic (i.e., we will know what the statistic should average out to over repeated samples, and how much the statistic should vary over repeated samples).

      1.4.1 Central limit theorem—normal version

      Suppose that a random sample of images data values, represented by images, comes from a population that has a mean of images and a standard deviation of images. The sample mean, images, is a pretty good estimate of the population mean, images. This textbook uses images for the sample mean of images rather than the traditional images (“images‐bar”), which, in the author's experience, is unfamiliar and awkward for many students. The very famous sampling distribution of this statistic derives from the central limit theorem. This theorem states that under very general conditions, the sample mean has an approximate normal distribution with mean images and standard deviation images (under repeated sampling). In other words, if we were to take a large number of random samples of images data values and calculate the mean for each sample, the distribution of these sample means would be a normal distribution with mean images and standard deviation images. Since the mean of this sampling distribution is images, images is an unbiased estimate of images.

      An amazing fact about the central limit theorem is that there is no need for the population itself to be normal (remember that we had to assume this for the calculations in Section 1.3). However, the more symmetric the distribution of the population, the better is the normal approximation for the sampling distribution of the sample mean. Also, the approximation tends to be better the larger the sample size images.

      First, we need to get some notation straight. In this section, we are not thinking about the specific sample mean we got for our actual sample of 30 sale prices, images. Rather we are imagining a list of potential sample means from a population distribution with mean 280 and standard deviation 50—we will call a potential sample mean in this list images. From the central limit theorem, the sampling distribution of