Iain Pardoe

Applied Regression Modeling


Скачать книгу

319.0 319.9 324.5 330.0 336.0 339.0 340.0 355.0 359.9 359.9

       1 | 6 2 | 0011344 2 | 5666777899 3 | 002223444 3 | 666

      In this plot, the decimal point is two digits to the right of the stem. So, the “1” in the stem and the “6” in the leaf represents 160 or, because of rounding, any number between 155 and 164.9. In particular, it represents the lowest price in the dataset of 155.5 (thousand dollars). The next part of the graph shows two prices between 195 and 204.9, two prices between 205 and 214.9, one price between 225 and 234.9, two prices between 235 and 244.9, and so on. A stem‐and‐leaf plot can easily be constructed by hand for small datasets such as this, or it can be constructed automatically using statistical software. The appearance of the plot can depend on the type of statistical software used—this particular plot was constructed using R statistical software (as are all the plots in this book). Instructions for constructing stem‐and‐leaf plots are available as computer help #13 in the software information files available from the book website at www.wiley.com/go/pardoe/AppliedRegressionModeling3e.

      The overall impression from this graph is that the sample prices range from the mid‐150s to the mid‐350s, with some suggestion of clustering around the high 200s. Perhaps the sample represents quite a range of moderately priced homes, but with no very cheap or very expensive homes. This type of observation often arises throughout a data analysis—the data begin to tell a story and suggest possible explanations. A good analysis is usually not the end of the story since it will frequently lead to other analyses and investigations. For example, in this case, we might surmise that we would probably be unlikely to find a home priced at much less than images in this market, but perhaps a realtor might know of a nearby market with more affordable housing.

Graph depicts the histogram for home prices example.

      Histograms can convey very different impressions depending on the bin width, start point, and so on. Ideally, we want a large enough bin size to avoid excessive sampling “noise” (a histogram with many bins that looks very wiggly), but not so large that it is hard to see the underlying distribution (a histogram with few bins that looks too blocky). A reasonable pragmatic approach is to use the default settings in whichever software package we are using, and then perhaps to create a few more histograms with different settings to check that we are not missing anything. There are more sophisticated methods, but for the purposes of the methods in this book, this should suffice.

       The sample mean, , is a measure of the “central tendency” of the data ‐values.

       The sample standard deviation, , is a measure of the spread or variation in the data ‐values.

      We will not bother here with the formulas for these sample statistics. Since almost all of the calculations necessary for learning the material covered by this book will be performed by statistical software, the book only contains formulas when they are helpful in understanding a particular concept or provide additional insight to interested readers.

      We can calculate sample standardized images‐values from the data images‐values:

equation

      Sometimes, it is useful to work with sample standardized images‐values rather than the original data images‐values since sample standardized images‐values have a sample mean of 0 and a sample standard deviation of 1. Try using statistical software to calculate sample standardized images‐values for the home prices data, and then check that the mean and standard deviation of the images‐values are 0 and 1, respectively.

      Statistical software can also calculate additional sample statistics, such as:

       the median (another measure of central tendency, but which is less sensitive than the sample mean to very small or very large values in the data)—half the dataset values are smaller than this quantity and half are larger;

       the minimum and maximum;

       percentiles or quantiles such as the 25th percentile—this is the smallest value that is larger than 25% of the values in the dataset (i.e., 25% of the dataset values are smaller than the 25th percentile, while 75% of the dataset values are larger).



Sample size, images Valid 30
missing 0
Mean 278.6033
Median 278.9500
Standard deviation 53.8656
Minimum 155.5000
Maximum 359.9000