Samprit Chatterjee

Handbook of Regression Analysis With Applications in R


Скачать книгу

slope coefficients are very similar to those from the model using all predictors (which is not surprising given the low collinearity in the data), so the interpretations of the estimated coefficients on page 17 still hold to a large extent. A plot of the residuals versus the fitted values and a normal plot of the residuals (Figure 2.2) look fine, and similar to those for the model using all six predictors in Figure 1.5; plots of the residuals versus each of the predictors in the model are similar to those in Figure 1.6, so they are not repeated here.

Image described by caption.

      Identifying and correcting for this uncertainty is a difficult problem, and an active area of research, and will be discussed further in Chapter 14. There are, however, a few things practitioners can do. First, it is not appropriate to emphasize too strongly the single “best” model; any model that has similar criteria values (such as images or images) to those of the best model should be recognized as being one that could easily have been chosen as best based on a different sample from the same population, and any implications of such a model should be viewed as being as valid as those from the best model. Further, one should expect that images‐values for the predictors included in a chosen model are potentially smaller than they should be, so taking a conservative attitude regarding statistical significance is appropriate. Thus, for the chosen three‐predictor model summarized on page 35, number of bathrooms and living area are likely to correspond to real effects, but the reality of the year built effect is more questionable.

      There is a straightforward way to get a sense of the predictive power of a chosen model if enough data are available. This can be evaluated by holding out some data from the analysis (a holdout or validation sample), applying the selected model from the original data to the holdout sample (based on the previously estimated parameters, not estimates based on the new data), and then examining the predictive performance of the model. If, for example, the standard deviation of the errors from this prediction is not very different from the standard error of the estimate in the original regression, chances are making inferences based on the chosen model will not be misleading. Similarly, if a (say) images prediction interval does not include roughly images of the new observations, that indicates poorer‐than‐expected predictive performance on new data.

Image described by caption.

      where images is based on the chosen “best” model, and images is the number of predictors in the most complex model examined, in the sense of most predictors (Ye, 1998). Clearly, if very complex models are included among the set of candidate models, images can be much larger than the standard error of the estimate from the chosen model, with correspondingly wider prediction intervals. This reinforces the benefit of limiting the set of candidate models (and the complexity of the models in that set) from the start. In this case images, so the effect is not that pronounced.