Iain Pardoe

Applied Regression Modeling


Скачать книгу

25 241.3750 50 278.9500 75 325.8750

      There are many other methods—numerical and graphical—for summarizing data. For example, another popular graph besides the histogram is the boxplot; see Chapter 6 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e) for some examples of boxplots used in case studies.

      While the methods of the preceding section are useful for describing and displaying sample data, the real power of statistics is revealed when we use samples to give us information about populations. In this context, a population is the entire collection of objects of interest, for example, the sale prices for all single‐family homes in the housing market represented by our dataset. We would like to know more about this population to help us make a decision about which home to buy, but the only data we have is a random sample of 30 sale prices.

      Nevertheless, we can employ “statistical thinking” to draw inferences about the population of interest by analyzing the sample data. In particular, we use the notion of a model—a mathematical abstraction of the real world—which we fit to the sample data. If this model provides a reasonable fit to the data, that is, if it can approximate the manner in which the data vary, then we assume it can also approximate the behavior of the population. The model then provides the basis for making decisions about the population, by, for example, identifying patterns, explaining variation, and predicting future values. Of course, this process can work only if the sample data can be considered representative of the population. One way to address this is to randomly select the sample from the population. There are other more complex sampling methods that are used to select representative samples, and there are also ways to make adjustments to models to account for known nonrandom sampling. However, we do not consider these here—any good sampling textbook should cover these issues.

      Since the real world can be extremely complicated (in the way that data values vary or interact together), models are useful because they simplify problems so that we can better understand them (and then make more effective decisions). On the one hand, we therefore need models to be simple enough that we can easily use them to make decisions, but on the other hand, we need models that are flexible enough to provide good approximations to complex situations. Fortunately, many statistical models have been developed over the years that provide an effective balance between these two criteria. One such model, which provides a good starting point for the more complicated models we consider later, is the normal distribution.

Graph depicts the histogram for a simulated population of 1,000 sale prices, together with a normal density curve. Graph depicts the standard normal density curve together with a shaded area of 0.475 between a=0 and b=1.96, which represents the probability that a standard normal random variable lies between 0 and 1.96.