James Blum

Fundamentals of Programming in SAS


Скачать книгу

outstanding mortgage.

      Program 2.6.1: Conditioning on a Variable Not Used in the Analysis

      proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

      class Metro;

      var HHIncome;

      format Metro Metro.;

      where Metro in (2,3,4)

      and

      MortgageStatus in

      (‘Yes, contract to purchase’,

      ‘Yes, mortgaged/ deed of trust or similar debt’);

      run;

      Output 2.6.1: Conditioning on a Variable Not Used in the Analysis

Analysis Variable : HHIncome
METRONMeanStd DevMinimumMaximum
Metro, Inside City578818627782749-199981361000
Metro, Outside City1910219631980292-299971266000
Metro, City Status Unknown1673598387972010-199981407000

      The condition on the MortgageStatus variable is a bit daunting, particularly noting that matching character values is a precise operation. Seemingly simple differences like casing or spacing lead to values that are non-matching. Therefore, the literals used in Program 2.6.1 are specified to be an exact match for the data. In Section 3.9, functions are introduced that are useful in creating consistency among character values, along with others that allow for extraction and use of relevant portions of a string. However, the WHERE statement provides some special operators, shown in Table 2.6.4, that allow for simplification in these types of cases without the need to intervene with a function.

      Table 2.6.4: Operators for General Comparisons

SymbolMnemonicLogic
?CONTAINSTrue result if the specified value is contained in the data value (character only).
LIKETrue result if data value matches the specified value which may include wildcards. _ is any single character, % is any set of characters.

      Program 2.6.2 offers two methods for simplifying the condition on MortgageStatus, one using CONTAINS, the other using LIKE. Either reproduces Output 2.6.1.

      Program 2.6.2: Conditioning on a Variable Using General Comparison Operators

      proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

      class Metro;

      var HHIncome;

      format Metro Metro.;

      where Metro in (2,3,4) and MortgageStatus contains ’Yes’;

      run;

      proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

      class Metro;

      var HHIncome;

      format Metro Metro.;

      where Metro in (2,3,4) and MortgageStatus like ’%Yes%’;

      run;

       CONTAINS checks to see if the data value contains the string Yes; again, note that the casing must be correct to ensure a match. Also, ensure single or double quotation marks enclose the value to search for—in this case, without the quotation marks, Yes forms a legal variable name and is interpreted by the compiler as a reference to a variable.

       LIKE allows for the use of wildcards as substitutes for non-essential character values. Here the % wildcard before and after Yes results in a true condition if Yes appears anywhere in the string and is thus logically equivalent to the CONTAINS conditioning above.

      To produce tables of frequencies and relative frequencies (percentages) like those shown for the case study in Outputs 2.2.3 and 2.2.4, the FREQ procedure is the tool of choice, and this section covers its fundamentals.

      As in previous sections, the examples here use the IPUMS2005Basic SAS data set, so make sure the BookData library is assigned. As a first step, enter and submit Program 2.7.1. (Note that the use of labels has been re-established in the OPTIONS statement.)

      Program 2.7.1: PROC FREQ with Variables Listed Individually in the TABLE Statement

      options label;

      proc freq data=BookData.IPUMS2005Basic;

      table metro mortgageStatus;

      run;

      The TABLE statement allows for specification of the variables to summarize, and a space-delimited list of variables produces a one-way frequency table for each, as shown in Output 2.7.1.

      Output 2.7.1: PROC FREQ with Variables Listed Individually in the TABLE Statement

Metropolitan status
METROFrequencyPercentCumulativeFrequencyCumulativePercent
0920287.94920287.94
123077519.9132280327.85
215436813.3247717141.17
334098229.4281815370.59
434090929.411159062100.00
MortgageStatusFrequencyPercentCumulativeFrequencyCumulativePercent
N/A30334226.1730334226.17
No, owned free and clear30034925.9160369152.08
Yes, contract to purchase97560.8461344752.93
Yes, mortgaged/ deed of trust or similar debt54561547.071159062100.00

      The TABLE statement is not required; however, in that case, the default behavior produces a one-way frequency table for every variable in the data set. Therefore, both types of SAS variables, character or numeric, are legal in the TABLE statement. Given that variables listed in the TABLE statement are treated as categorical (in the same manner as variables listed in the CLASS statement in PROC MEANS), it is best to have the summary variables be categorical or be formatted into a set of categories.

      The default summaries in a one-way frequency table are: frequency (count), percent, cumulative frequency, and cumulative percent. Of course, the cumulative statistics only make sense if the categories are ordinal, which these are not. Many options are available in the table statement to control what is displayed, and one is given in Program 2.7.2 to remove the cumulative statistics.

      Program 2.7.2: PROC FREQ Option for Removing Cumulative Statistics

      proc freq data=BookData.IPUMS2005Basic;

      table metro mortgageStatus / nocum;

      run;

      As with the CLASS statement in the MEANS procedure, variables listed in the TABLE statement in PROC FREQ use the format provided with the variable to construct the categories. Program 2.7.3 uses a format defined in Program 2.5.6 to bin the MortgagePayment variable into categories and, as this is an ordinal set, the cumulative statistics are appropriate.

      Program 2.7.3: Using a Format to Control Categories for a Variable in the TABLE Statement

      proc format;

      value Mort

      0=’None’

      1-350=”$350 and Below”

      351-1000=”$351 to $1000”

      1001-1600=”$1001 to $1600”

      1601-high=”Over $1600”

      ;

      run;

      proc freq data=BookData.IPUMS2005Basic;

      table MortgagePayment;

      format