For this case, a simplified format that distinguishes metro, non-metro, and non-identifiable observations may be desired. Program 2.5.3 contains two approaches to this, the first being clearly the most efficient.
Program 2.5.3: Assigning Multiple Values to the Same Formatted Value
proc format;
value MetroB
0 = “Not Identifiable”
1 = “Not in Metro Area”
2,3,4 = “In a Metro Area”
;
value MetroC
0 = “Not Identifiable”
1 = “Not in Metro Area”
2 = “In a Metro Area”
3 = “In a Metro Area”
4 = “In a Metro Area”
;
run;
A comma-separated list of values is legal on the left side of each assignment, which assigns the formatted value to each listed data value.
This format accomplishes the same result; however, it is important that the literal values on the right side of the assignment are exactly the same. Differences in even simple items like spacing or casing results in different formatted values.
Either format given in Program 2.5.3 can replace the Metro format in Program 2.5.2 to create the result in Output 2.5.3.
Output 2.5.3: Assigning Multiple Values to the Same Formatted Value
Analysis Variable : HHIncome | |||||
METRO | N | Mean | Std Dev | Minimum | Maximum |
Not Identifiable | 92028 | 54800 | 52333 | -19998 | 1076000 |
Not in Metro Area | 230775 | 47856 | 45547 | -29997 | 1050000 |
In a Metro Area | 836259 | 69024 | 71495 | -29997 | 1739770 |
It is also possible to use the dash character as an operator in the form of ValueA-ValueB to define a range on the left side of any assignment, which assigns the formatted value to every data value between ValueA and ValueB, inclusive. Program 2.5.4 gives an alternate strategy to constructing the formats given in Program 2.5.3 and that format can also be placed into Program 2.5.2 to produce Output 2.5.3.
Program 2.5.4: Assigning a Range of Values to a Single Formatted Value
proc format;
value MetroD
0 = “Not Identifiable”
1 = “Not in Metro Area”
2-4 = “In a Metro Area”
;
run;
Certain keywords are also available for use on the left side of an assignment, one of which is OTHER. OTHER applies the assigned format to any value not listed on the left side of an assignment elsewhere in the format definition. Program 2.5.5 uses OTHER to give another method for creating a format that can be used to generate Output 2.5.3. It is important to note that using OTHER often requires significant knowledge of exactly what values are present in the data set.
Program 2.5.5: Assigning a Range of Values to a Single Formatted Value
proc format;
value MetroE
0 = “Not Identifiable”
1 = “Not in Metro Area”
other = “In a Metro Area”
;
run;
In general, value ranges should be non-overlapping, and the < symbol—called an exclusion operator in this context—can be used at either end (or both ends) of the dash to indicate the value should not be included in the range. Overlapping ranges are discussed in Chapter Note 5 in Section 2.12. Using exclusion operators to create non-overlapping ranges allows for the categorization of a quantitative variable without having to know the precision of measurement. Program 2.5.6 gives two variations on creating bins for the MortgagePayment data and uses those bins as classes in PROC MEANS, with the results shown in Output 2.5.6A and Output 2.5.6B.
Program 2.5.6: Binning a Quantitative Variable Using a Format
proc format;
value Mort
0=’None’
1-350=”$350 and Below”
351-1000=”$351 to $1000”
1001-1600=”$1001 to $1600”
1601-high=”Over $1600”
;
value MortB
0=’None’
1-350=”$350 and Below”
350<-1000=”Over $350, up to $1000”
1000<-1600=”Over $1000, up to $1600”
1600<-high=”Over $1600”
;
run;
proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;
class MortgagePayment;
var HHIncome;
format MortgagePayment Mort.;
run;
proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;
class MortgagePayment;
var HHIncome;
format MortgagePayment MortB.;
run;
The keywords LOW and HIGH are available so that the maximum and minimum values need not be known. When applied to character data, LOW and HIGH refer to the sorted alphanumeric values. Note that the LOW keyword excludes missing values for numeric variables but includes missing values for character variables.
In these value ranges, the values used exploit the fact that the mortgage payments are reported to the nearest dollar.
Using the < symbol to not include the starting ranges allows the bins to be mutually exclusive and exhaustive irrespective of the precision of the data values. The exclusion operator, <, omits the adjacent value from the range so that 350<-1000 omits only 350, 350-<1000 omits only 1000, and 350<-<1000 omits both 350 and 1000.
When a format is present for a class variable, the format is used to construct the unique values for each category, and this behavior persists in most cases where SAS treats a variable as categorical.
Output 2.5.6A: Binning a Quantitative Variable Using the Mort Format
Analysis Variable : HHIncome | |||||
MortgagePayment | N | Mean | Std Dev | Minimum | Maximum |
None | 603691 | 45334 | 53557 | -22298 | 1739770 |
$350 and Below | 59856 | 47851 | 42062 | -16897 | 841000 |
$351 to $1000 | 283111 | 64992 | 45107 | -19998 | 1060000 |
$1001 to $1600 | 128801 | 96107 | 63008 | -29997 | 1125000 |
Over $1600 | 83603 | 153085 | 117134 | -29997 | 1407000 |
Output 2.5.6B: Binning a Quantitative Variable Using the MortB Format
Analysis Variable : HHIncome | |||||
MortgagePayment | N | Mean | Std Dev | Minimum | Maximum |
None | 603691 | 45334 | 53557 | -22298 | 1739770 |
$350 and Below | 59856 | 47851 | 42062 | -16897 | 841000 |
Over
|