James Blum

Fundamentals of Programming in SAS


Скачать книгу

the name as DATAn, where n is the smallest whole number (1, 2, 3, …) that makes the data set name unique.

       The INFILE statement specifies the location of the file via a full path specification to the file—this path must be completed to reflect the location of the raw file for the code to execute successfully.

       The INPUT statement sets the names of each variable from the raw file in the INFILE statement with those names following the conventions outlined in Section 1.6.2. By default, SAS assumes the incoming variables are numeric. One way to indicate character data is shown here – place a dollar sign after each character variable.

       Good programming practice dictates that all steps end with an explicit step boundary, including the DATA step.

       The OBS= option selects the last observation for processing. Because procedures start with the first observation by default, this step uses the first five observations from the Utility2001 data set, as shown in Output 2.8.1.

      Output 2.8.1: Reading the Utility 2001 Data (Partial Listing)

ObsSerialElectricGasWaterFuel
111800999899989998
22480144099989998
3320403601009998
44300099983609998
558401320909998

      In Program 2.8.1, Serial is read as a character variable; however, it contains only digits and therefore can be stored as numeric. The major advantage in storing Serial as character is size—its maximum value is six digits long and therefore requires six bytes of storage as character, while all numeric variables have a default size of eight bytes. The major disadvantage to storing Serial as character is ordering—for example, as a character value, 11 comes before 2. While the other four variables can be read as character as well, it is a very poor choice as no mathematical or statistical operations can be done on those values. For examples in subsequent sections, Serial is read as numeric.

      In Program 2.8.1, the INFILE statement is used to specify the raw data file that the DATA step reads. In general, the INFILE statement may include references to a single file or to multiple files, with each reference provided one of the following ways:

       A physical path to the files. Physical paths can be either relative or absolute.

       A file reference created via the FILENAME statement.

      Program 2.8.1 is set up to use the first method, with either an absolute or relative path chosen. An absolute path starts with a drive letter or name, while any other specification is a relative path. All relative paths are built from the current working directory. (Refer to Section 1.5 for a discussion of the working directory and setting its value.) It is often more efficient to use a FILENAME statement to build references to external files or folders. Programs 2.8.2 and 2.8.3 demonstrate these uses of the FILENAME statement, producing the same data set as Program 2.8.1.

      Program 2.8.2: Using the FILENAME Statement to Point to an Individual File

      filename Util2001  “--insert path here--\Utility 2001.prn”;

      data work.Utility2001A;

      infile Util2001;

      input Serial$ Electric Gas Water Fuel;

      run;

       The FILENAME statement creates a file reference, called a fileref, named Util2001. Naming conventions for a fileref are the same as those for a libref.

       The path specified, which can be relative or absolute as in Program 2.8.1, includes the file name. SAS assigns the fileref Util2001 to this file.

       The INFILE statement now references the fileref Util2001 rather than the path or file name. Note, quotation marks are not used on Util2001 since it is to be interpreted as a fileref and not a file name or path.

      Program 2.8.3: Associating the FILENAME Statement with a Folder

      filename RawData ‘--insert path to folder here--’; 

      data work.Utility2001B;

      infile RawData(“Utility 2001.prn”);

      input Serial$ Electric Gas Water Fuel;

      run;

       It is assumed here that the path, either relative or absolute, points to a folder and not a specific file. In that case, the FILENAME statement associates a folder with the fileref RawData. The path specified should be to the folder containing the raw files downloaded from the author page, much like the BookData library was assigned to the folder containing the SAS data sets.

       The INFILE statement references both the fileref and the file name. Although the file reference can be made without the quotation marks in certain cases, good programming practice includes the quotation marks.

      Since each of Programs 2.8.2 and 2.8.3 generate the same result as Program 2.8.1 but actually require slightly more code, the benefits of using the FILENAME statement may not be obvious. The form of the FILENAME in Program 2.8.3 is useful if a single file needs to be read repeatedly under different conditions, allowing the multiple references to that file to be shortened. More commonly, the form used in Program 2.8.4 is more efficient when reading multiple files from a common location. Again, if the path specified is to the folder containing the raw files downloaded from the author page, the fileref RawData refers to the location for all non-SAS data sets used in examples for Chapters 2 through 7.

      Input Data 2.8.4 includes a partial representation of the first five records from a comma-delimited file (IPUMS2005Basic.csv). Due to the width of the file, Input Data 2.8.4 truncates the third and fifth records.

      Input Data 2.8.4: Comma Delimited Raw File (Partial Listing)

----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
2,Alabama,Not in identifiable city (or size group),0,4,73,Rented,N/A,0,12000,9999999
3,Alabama,Not in identifiable city (or size group),0,1,0,Rented,N/A,0,17800,9999999
4,Alabama,Not in identifiable city (or size group),0,4,73,Owned,”Yes, mortgaged/ deed
5,Alabama,Not in identifiable city (or size group),0,1,0,Rented,N/A,0,2000,9999999
6,Alabama,Not in identifiable city (or size group),0,3,97,Owned,”No, owned free and

      Not only is this file delimited by commas, but the eighth field on the third and fifth rows also includes data values containing a comma, with those values embedded in quotation marks. (Recall these records are truncated in the text due to their length so the final quote is not shown for these two records.) To successfully read this file, the DATA step must recognize the delimiter as a comma, but also that commas embedded in quoted values are not delimiters. The DSD option is introduced in Program 2.8.4 to read such a file.

      Program 2.8.4: Reading the 2005 Basic IPUMS CPS Data

      data work.Ipums2005Basic;

      infile RawData(“IPUMS2005basic.csv”) dsd;

      input Serial State $ City $ CityPop Metro

      CountyFIPS Ownership $ MortgageStatus $

      MortgagePayment HHIncome HomeValue;

      run;

      proc print data = work.Ipums2005Basic (obs=5);

      run;

       The DSD option included in the INFILE statement modifies the delimiter and some additional default behavior as listed below.

       Again, the INPUT statement names each of the variables read from the raw file in the INFILE statement and sets their types. By default, SAS assumes the incoming variables are numeric; however, State, City, Ownership, and MortgageStatus must be read as character values.

      Output 2.8.4 shows that, while Program 2.8.4 executes successfully, the resulting