observations, mean, standard deviation, minimum, and maximum.
Output 2.4.1: Default Statistics and Behavior for PROC MEANS
Variable | N | Mean | Std Dev | Minimum | Maximum |
SERIALCOUNTYFIPSMETROCITYPOPMortgagePaymentHHIncomeHomeValue | 1159062115906211590621159062115906211590621159062 | 621592.2442.20629012.52453542916.66500.204263463679.842793526.49 | 359865.4178.95432851.308530212316.27737.988559266295.974294777.18 | 2.00000000000-29997.005000.00 | 1245246.00810.00000004.000000079561.007900.001739770.009999999.00 |
SAS differentiates variable types as numeric and character only; therefore, variables stored as numeric that are not quantitative are summarized even if those summaries do not make sense. Here, the Serial, CountyFIPS, and Metro variables are stored as numbers, but means and standard deviations are of no utility on these since they are nominal. It is, of course, important to understand the true role and level of measurement (for instance, nominal versus ratio) for the variables in the data set being analyzed.
To select the variables for analysis, the MEANS procedure includes the VAR statement. Any variables listed in the VAR statement must be numeric, but should also be appropriate for quantitative summary statistics. As in the previous example, the summary for each variable is listed in its own row in the output table. (If only one variable is provided, it is named in the header above the table instead of in the first column.) Program 2.4.2 modifies Program 2.4.1 to summarize only the truly quantitative variables from BookData.IPUMS2005Basic, with the results shown in Output 2.4.2.
Program 2.4.2: Selecting Analysis Variables Using the VAR Statement in MEANS
proc means data=BookData.IPUMS2005Basic;
var Citypop MortgagePayment HHIncome HomeValue;
run;
Output 2.4.2: Selecting Analysis Variables Using the VAR Statement in MEANS
Variable | N | Mean | Std Dev | Minimum | Maximum |
CITYPOPMortgagePaymentHHIncomeHomeValue | 1159062115906211590621159062 | 2916.66500.204263463679.842793526.49 | 12316.27737.988559266295.974294777.18 | 00-29997.005000.00 | 79561.007900.001739770.009999999.00 |
The default summary statistics for PROC MEANS can be modified by including statistic keywords as options in the PROC MEANS statement. Several statistics are available, with the available set listed in the SAS Documentation, and any subset of those may be used. The listed order of the keywords corresponds to the order of the statistic columns in the table, and those replace the default statistic set. One common set of statistics is the five-number summary (minimum, first quartile, median, third quartile, and maximum), and Program 2.4.3 provides a way to generate these statistics for the four variables summarized in the previous example.
Program 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS
proc means data=BookData.IPUMS2005Basic min q1 median q3 max;
var Citypop MortgagePayment HHIncome HomeValue;
run;
Output 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS
Variable | Minimum | Lower Quartile | Median | Upper Quartile | Maximum |
CITYPOPMortgagePaymentHHIncomeHomeValue | 00-29997.005000.00 | 0024000.00112500.00 | 0047200.00225000.00 | 0830.000000080900.009999999.00 | 79561.007900.001739770.009999999.00 |
Confidence limits for the mean are included in the keyword set, both as a pair with the CLM keyword, and separately with LCLM and UCLM. The default confidence level is 95%, but is changeable by setting the error rate using the ALPHA= option. Consider Program 2.4.4, which constructs the 99% confidence intervals for the means, with the estimated mean between the lower and upper limits.
Program 2.4.4: Using the ALPHA= Option to Modify Confidence Levels
proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01;
var Citypop MortgagePayment HHIncome HomeValue;
run;
Output 2.4.4: Using the ALPHA= Option to Modify Confidence Levels
Variable | Lower 99%CL for Mean | Mean | Upper 99%CL for Mean |
CITYPOPMortgagePaymentHHIncomeHomeValue | 2887.19498.438574963521.222783250.94 | 2916.66500.204263463679.842793526.49 | 2946.12501.969952063838.462803802.04 |
There are also options for controlling the column display; rounding can be controlled by the MAXDEC= option (maximum number of decimal places). Program 2.4.5 modifies the previous example to report the statistics to a single decimal place.
Program 2.4.5: Using MAXDEC= to Control Precision of Results
proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01 maxdec=1;
var Citypop MortgagePayment HHIncome HomeValue;
run;
Output 2.4.5: Using MAXDEC= to Control Precision of Results
Variable | Lower 99%CL for Mean | Mean | Upper 99%CL for Mean |
CITYPOPMortgagePaymentHHIncomeHomeValue | 2887.2498.463521.22783250.9 | 2916.7500.263679.82793526.5 | 2946.1502.063838.52803802.0 |
MAXDEC= is limited in that it sets the precision for all columns. Also, no direct formatting of the statistics is available. The REPORT procedure, introduced in Chapter 4 and discussed in detail in Chapters 6 and 7, provides much more control over the displayed table at the cost of increased complexity of the syntax.
2.4.2 Using the CLASS Statement in PROC MEANS
In several instances, it is desirable to split an analysis across a set of categories and, if those categories are defined by a variable in the data set, PROC MEANS can separate those analyses using a CLASS statement. The CLASS statement accepts either numeric or character variables; however, the role assigned to class variables by SAS is special. Any variable included in the CLASS statement (regardless of type) is taken as categorical, which results in each distinct value of the variable corresponding to a unique category. Therefore, variables used in the CLASS statement should provide useful groupings or, as shown in Section 2.5, be formatted into a set of desired groups. Two examples follow, the first (Program 2.4.6) providing an illustration of a reasonable class variable, the second (Program 2.4.7) showing a poor choice.
Program 2.4.6: Setting a Class Variable in PROC MEANS
proc means data=BookData.IPUMS2005Basic;
class MortgageStatus;
var HHIncome;
run;
Output 2.4.6: Setting a Class Variable in PROC MEANS
Analysis Variable : HHIncome | ||||||
MortgageStatus | N Obs | N | Mean | Std Dev | Minimum | Maximum |
N/A | 303342 | 303342 | 37180.59 | 39475.13 | -19998.00 | 1070000.00 |
No, owned free and clear | 300349 | 300349 | 53569.08 | 63690.40 | -22298.00 | 1739770.00 |
Yes, contract to purchase | 9756 | 9756 | 51068.50 | 46069.11 | -7599.00 | 834000.00 |
Yes, mortgaged/ deed of trust or similar debt | 545615 | 545615 | 84203.70 | 72997.92 | -29997.00 | 1407000.00 |
In this data, MortgageStatus provides a clear set of distinct categories and is potentially useful for subsetting the summarization of the data. In Program 2.4.7, Serial is used as an extreme example of a poor choice since Serial is unique to each household.
Program 2.4.7: A Poor Choice for a Class Variable
proc means data=BookData.IPUMS2005Basic;
class Serial;
var HHIncome;
run;
Output 2.4.7: A Poor Choice for a Class Variable (Partial Table Shown)
Analysis
|