Ron Cody

A Gentle Introduction to Statistics Using SAS Studio


Скачать книгу

have produced with Lauree, and she is a delight to work with. Thank you, Lauree!

      Because this is a book about statistics and SAS, I had a team of technical reviewers who had expertise in either statistics, SAS, or both. Once again I need to give a huge shout-out to Paul Grant. I’m pretty sure Paul has reviewed every book I have published with SAS Press. Not only does he carefully read every word, he also runs all of the programs to be sure the programs that you can download from my SAS author site match the programs printed in the book. That is an amazing amount of work, and I don’t understand why he keeps coming back for more. I have known Jeff Smith for over 40 years and co-authored two books with him. Jeff makes sure that my sometimes loose discussion of statistical topics will not upset “real” statisticians. Holly Sweeney is both a statistician and SAS expert. I was so fortunate to have her carefully read every word of this book. Her critiques and comments really helped make this a better book. My last technical reviewer, Amy Peters, is a developer in the SAS Studio group and helped me in so many ways. It’s such a pleasure to have an expert like Amy ready to assist anytime I call or email. So, a hearty thanks to Paul, Jeff, Holly, and Amy!

      There is a team made up of Denise Jones (technical publishing specialist), Robert Harris (graphics designer), Suzanne Morgen (copy editor), and Melissa Hannah (digital marketing specialist) who all played key roles in getting this book to press (or e-Book). It really takes a team to produce a book like this, especially when it needs to be available in print form and several different electronic media. Thank you all!

      I already mentioned Robert Harris (graphics designer) but he needs special thanks for creating three different cover designs for me to choose from. I liked all three, but the one you see here was my favorite (and my wife’s favorite also).

      Speaking of wives, thank you Jan for your support and for making me take a break once in a while. You even took my picture for the back cover!

      Chapter 1: Descriptive and Inferential Statistics

       Overview

       Descriptive Statistics

       Inferential Statistics

       Summary of Statistical Terms

      Many people have a misunderstanding of what statistics entails. The trouble stems from the fact that the word “statistics” has several different meanings. One meaning relates to numbers such as batting averages and political polls. When I tell people that I’m a statistician, they assume that I’m good with numbers. Actually, without a computer I would be lost.

      The other meaning, the topic of this book, is to describe collections of numbers such as test scores and to describe properties of these numbers. This subset of statistics is known as descriptive statistics. Another subset of statistics, inferential statistics, takes up a major portion of this book. One of the goals of inferential statistics is to determine whether your experimental results are “statistically significant.” In other words, what is the probability that the result that you obtained could have occurred by chance, rather than an actual effect?

      I am sure every reader of this book is already familiar with some aspects of descriptive statistics. From early in your education, you were assigned a grade in a course, based on your average. Averages (there are several types) describe what statisticians refer to as measures of location or measures of central tendency. Most basic statistics books describe three indicators of location: the mean, median, and mode. To compute a mean, you add up all the numbers and divide by how many numbers you have. For example, if you took five tests and your scores were 80, 82, 90, 96 and 96, the mean would be (80 + 82 + 90 + 96 + 96)/5 or 88.8. To compute a median, you arrange the numbers in order from lowest to highest and then find the middle—half the numbers will be below the median and half of the numbers will be above the median. In the example of the five test scores (notice that they are already in order from lowest to highest), the median is 90. If you have an even number of numbers, one method of computing the median is to average the two numbers in the middle. The last measure of central tendency is called the mode. It is defined as the most common number. In this example, the mode is 96 because it occurs more than any other number. If all the numbers are different, the mode is not defined.

      Besides knowing the mean or median (the mode is rarely used), you can also compute several measures of dispersion. Dispersion describes how spread out the numbers are. One very simple measure of dispersion is the range, defined as the difference between the highest and lowest value. In the test score example, the range is 96 – 80 = 16. This is not a very good indicator of dispersion because it is computed using only two numbers—the highest and lowest value.

      The most common measure of dispersion is called the standard deviation. The computation is a bit complicated, but a good way to think about the standard deviation is that it is similar to the average amount each of the numbers differs from the mean, treating each of the differences as a positive number. The actual computation of a standard deviation is to take the difference of each number from the mean, square all the differences (that makes all the values positive), add up all the squared differences, divide by the number of values, minus one, and then take the square root of this value. Because this calculation is a lot of work, we will let the computer do the calculation rather than doing it by hand.

      Figure 1.1 below shows part of the output from SAS when you ask it to compute descriptive statistics on the five test scores:

      Figure 1.1: Example of Output from SAS Studio

Figure 1.1 Some JMP Help Options

      This shows three measures of location and several measures of dispersion (labeled Variability in the output). The value labeled “Std Deviation” is the standard deviation described previously, and the range is the same value that you calculated. The variance is the standard deviation squared, and it is used in many of the statistical tests that we discuss in this book.

      Descriptive statistics includes many graphical techniques such as histograms and scatter plots that you will learn about in the chapter on SAS Studio descriptive statistics.

      Let’s imagine an experiment where you want to test if drinking regular coffee has an effect on heart rate. You want to do this experiment because you believe caffeine might increase heart rate, but you are not sure. To start, you get some volunteers who are willing to drink regular coffee or decaf coffee and have their heart rates measured. The reason for including decaf coffee in the experiment is so that you can separate the placebo effect from a possible real effect. Because some of the volunteers may have a preconceived notion that coffee will increase their heart rate, their heart rate might increase because of a psychological reason, rather than the chemical effect of caffeine in the coffee.

      You divide your 20 volunteers into two groups—to drink regular or decaf coffee. This is done in a random fashion ,and neither the volunteers nor the person measuring the heart rates knows whether the person is drinking regular or decaf coffee. This type of experiment is referred to as a double-blind, placebo-controlled, clinical trial. We will discuss this design and several others in the next chapter.

      Suppose the mean heart rate in the regular coffee group is 76 and the mean heart rate in the decaf (placebo) group is 72. Can you conclude that caffeine increases heart rate? The answer is “maybe.” Why is that? Suppose that caffeine had no effect on heart rate (this is called the null hypothesis). If that were true, and you measured the mean heart rate in two groups of 10 subjects, you would still expect the two means to differ somewhat due to chance or natural variation. What a statistical test does is to compute the probability that you would obtain a difference as large or larger than you measured (4 points in this example) by chance alone if the null hypothesis were true.