Ron Cody, EdD

SAS Statistics by Example


Скачать книгу

a 95% Confidence Interval and the Standard Error

      A 95% confidence interval for the mean (often abbreviated as 95% CI) is useful in helping you decide how well your sample mean estimates the mean of the population from which you took your sample. Another measure, the standard error, is also useful for the same reason. This program shows how to compute both:

title “Computing a 95% Confidence Interval and the Standard Error”; proc means data=example.Blood_Pressure n mean clm stderr maxdec=3; class Drug; var SBP DBP; run;

      In this example, some of the options that were used previously have been omitted to reduce the size of the output. This program also uses the option CLM (confidence limit for the mean) to request the interval. SAS uses this option because the upper and lower bounds on a confidence interval are also referred to as confidence limits. The option STDERR requests that the standard error also be listed in the output, which follows:

Image492.png

      Another SAS procedure, PROC UNIVARIATE, produces output that is similar to the output from PROC MEANS. However, PROC UNIVARIATE provides additional statements that produce histograms and probability plots.

      The following program demonstrates these features of PROC UNIVARIATE:

title “Demonstrating PROC UNIVARIATE”; proc univariate data=example.Blood_Pressure; id Subj; var SBP DBP; histogram; probplot / normal(mu=est sigma=est); run;

      Program 2.5 demonstrates a typical use of PROC UNIVARIATE—to produce descriptive statistics and some graphical output. Note that in order to generate the histogram and probability plots, you need to have SAS/GRAPH installed.

      The ID statement is not necessary, but it is particularly useful with PROC UNIVARIATE. With this statement, you can specify a variable that identifies each observation. In this example, Subj is the ID variable.

      The VAR statement works with PROC UNIVARIATE in the same way that it works with PROC MEANS—it enables you to list the variables that you want to analyze.

      The HISTOGRAM statement requests histograms. You can follow the HISTOGRAM statement with a list of variables. If you omit this list of variables, the procedure produces a histogram for every variable that you listed on the VAR statement.

      Finally, the PROBPLOT statement requests a probability plot. This plot shows percentiles from a theoretical distribution on the x-axis and data values on the y-axis. This example program selects the normal distribution using the NORMAL option after the forward slash. If your data values are normally distributed, the points on this plot will form a straight line. To make it easier to see deviations from normality, the option NORMAL also produces a reference line where your data values would fall if they came from a normal distribution. When you use the NORMAL option, you also need to specify a mean and standard deviation. Specify these by using the keyword MU= to specify the mean and the keyword SIGMA= to specify a standard deviation. The keyword EST tells the procedure to use the data values to estimate the mean and standard deviation, instead of some theoretical value.

      Notice the slash between the word PROBPLOT and NORMAL. Using a slash here follows standard SAS syntax: if you want to specify options for any statement in a PROC step, follow the statement keyword with a slash. (Note: It took the author several years to figure this out for himself.)

      To save space, the following output shows only the results for the variable SBP. Each section is presented separately, with a discussion following each section.

Image500.png Image509.png Image518.png

      Circle1.png The first section of the output contains come useful and some not-so-useful values. For example, you see the number of nonmissing values that were used to compute the statistics (N), mean, and standard deviation.

      Also in this section, you see skewness and kurtosis, measures that show deviations from normality. A skewness value of 0 indicates a symmetric distribution about the mean; positive skewness values indicate a right-skewed distribution, and negative values indicate a left-skewed distribution. Left and right refer to the direction in which the elongated tail points. The value -.145 in this listing is very close to 0 and shows that there are no pronounced tails in the distribution of SBP. Kurtosis values indicate whether the distribution is more peaked than or flatter than a normal distribution. The value that SAS computes for kurtosis is scaled so that you get the value 0 for a normal distribution (also known as relative kurtosis). Positive values for kurtosis indicate both that the distribution is too peaked (leptokurtic) and that the tails are too heavy. Negative values for kurtosis indicate that the distribution is too flat (platykurtic) and that the tails are too light. The kurtosis value for SBP (-.535) indicates that the distribution of SBP is reasonably consistent with a normal distribution.

      The coefficient of variation (often abbreviated CV) expresses the standard deviation as a percent of the mean. This output shows that the standard deviation is about 8.38% of the mean. Finally, the value at the bottom right of this section is the standard error of the mean (1.46), which gives you an estimate of how accurately this sample has estimated the population mean.

      The remaining values in the section are less useful. This author believes that they were originally included so that you could use them in hand calculations of other statistics that were not computed by SAS. The sum of weights is useful only if you use a WEIGHT statement with PROC UNIVARIATE; with a WEIGHT statement you select a variable that weights the SBP values. In this example, because you did not specify any weights, the sum of weights is equal to the number of observations (all the weights are equal to 1). The uncorrected SS is the sum of squares of all the data values. To compute the corrected SS, you subtract the mean from each value before you square them, and then add them up. This value is the same as the numerator of the sample variance.

      Circle2.png The values listed in this section are somewhat redundant. They are grouped here for convenience as measures of location (mean, median, and mode) and measures of variability (standard deviation, variance, range, and interquartile range).

      Circle3.png This section displays a number of statistical tests that determine whether various measures of central location are significantly different from a theoretical value (mu). The default value for mu is mu=0. You can change the default value to another value by using the procedure option MU=n, where n is the nonzero value of your choice.

      The tests listed in this section are a one-sample Student’s t-test, a sign test, and a signed-rank test (also known as the Wilcoxon signed-rank test). These statistics are discussed in Chapter 5 (one-sample t-test) and Chapter 12 (the sign and Wilcoxon tests).

Image526.png

      Circle4.png Continuing the examination of the PROC UNIVARIATE output, you see a list of commonly used quantiles. The most useful values are the lowest value (0% Min), first quartile (25% Q1), median (50% Median), third quartile (75% Q3), and the maximum value (100% Max). If you supply PROC UNIVARIATE with some options, it can compute