2.8.1 through 2.8.7 make use of simple list input for every variable, and Programs 2.8.8 and 2.8.9 use column input for every variable. However, it may not always be the case of making a choice between one or the other. If files contain some delimited fields while other fields have fixed positions, it is necessary to use multiple input styles simultaneously. This process, called mixed input, requires mastery of two other input methods covered in Chapter 3, modified list input and formatted input, along with a substantial understanding of how the DATA step processes raw data. For a discussion of the fifth and final input style, named input, see the SAS Documentation.
2.9 Details of the DATA Step Process
This section provides further details about how the DATA step functions. While this material can initially be considered optional for many readers, understanding it makes writing high-quality code easier by providing a foundation for how certain coding decisions lead to particular outcomes. This material is also essential for successful completion of the base certification exam.
2.9.1 Introduction to the Compilation and Execution Phases
SAS processes every step in Base SAS, including the DATA step, in two phases: compilation and execution. Each of the DATA steps seen so far in this text have several elements in common: they each read data from one or more sources (for example, a SAS data set or a raw data file), and they each create a data set as a result of the DATA step. For DATA steps such as these, the flowchart in Figure 2.9.1 provides a high-level overview of the actions taken by SAS upon submission of a DATA step. Details about the individual actions are included in this section, in the Chapter Notes in Section 2.12, and in subsequent chapters.
Figure 2.9.1: Flowchart of Default DATA Step Actions
Compilation Phase
During the compilation phase, SAS begins by tokenizing the submitted code and sending complete statements to the compiler. (For more details, see Chapter Note 8 in Section 2.12.) Once a complete statement is sent to the compiler, the compiler checks the statement for syntax errors. If there is a syntax error, SAS attempts to make a substitution that creates legal syntax and prints a warning to the SAS log indicating the substitution made. For example, misspelling the keyword DATA as DAAT produces the following warning.
WARNING 14-169: Assuming the symbol DATA was misspelled as daat.
Be sure to review these warnings and correct the syntax even if SAS makes an appropriate substitution. If there is a syntax error and SAS cannot make a substitution, then an error message is printed to the log, and the current step is not executed. For example, misspelling the keyword DATA as DSTS results in the following error.
ERROR 180-322: Statement is not valid or it is used out of proper order.
If there is not a syntax error, or if SAS can make a substitution to correct a syntax error, then the compilation phase continues to the next statement, tokenizes it, and checks it for syntax errors. This process continues until SAS compiles all statements in the current DATA step.
When reading raw data, SAS creates an input buffer to load individual records from the raw data and creates the program data vector to assign the parsed values to variables for later delivery to the SAS data set. During this process, SAS also creates the shell for the descriptor portion, or metadata, for the data set, which is accessible via procedures such as the CONTENTS procedure from Chapter 1. Of course, not all elements of the descriptor portion, such as the number of observations, are known during the compilation phase. Once the compilation phase ends, SAS enters the execution phase where the compiled code is executed. At the conclusion of the execution phase, SAS populates any such remaining elements of the descriptor portion.
Execution Phase
The compilation phase creates the input buffer (when reading from a raw data source) and creates the program data vector; however, it is the execution phase that populates them. SAS begins by initializing the variables in the program data vector based on data type (character or numeric) and variable origin (for example, a raw data file or a SAS data set). SAS then executes the programming statements included in the DATA step. Certain statements, such as the LENGTH or FORMAT statements shown earlier in this chapter, are considered compile-time statements because SAS completes their actions during the compilation phase. Compile-time statements take effect during the compilation phase, and their effects cannot be altered during the execution phase. Statements active during the execution phase are referred to as execution-time statements.
Finally, when SAS encounters the RUN statement (or any other step boundary) the default actions are as follows:
1. output the current values of user-selected variables to the data set
2. return to the top of the DATA step
3. reset the values in the input buffer and program data vector
At this point, the input buffer (if it exists) is empty, and the program data vector variables are incremented/reinitialized as appropriate so that the execution phase can continue processing the incoming data set. For more information about step boundaries, see Chapter Note 9 in Section 2.12.
When reading in data from various sources, the execution phase ends when it is determined that no more data can or should be read, based on the programming statements in the DATA step. Because there are multiple factors that affect this, an in-depth discussion is not provided here. Instead, as each new technique for reading and combining data is presented, a review of when the DATA step execution phase ends is included. This chapter includes examples on reading a single raw data using an INFILE statement and, in this case, the execution phase ends when SAS encounters an end-of-file (EOF) marker in the incoming data source. For plain text files, the EOF marker is a non-printable character that lets software reading the file know that the file contains no further information. At the conclusion of the execution phase, SAS completes the content portion of the data set, which contains the data values, and finalizes the descriptor portion.
2.9.2 Building blocks of a Data Set: Input Buffers and Program Data Vectors
Input Buffer
When reading raw data, SAS needs to parse the characters from the plain text in order to determine the values to place in the data set. Parsing involves dividing the text into groups of characters and interpreting each group as a value for a variable. To facilitate this, the default is for SAS to read a single line of text from the raw file and place it into the input buffer—a section of logical memory. In the input buffer, SAS places each character into its own column and uses a column pointer to keep track of the column the INPUT statement is currently reading.
Program Data Vector
Regardless of the data source used in a DATA step (raw data files or SAS data sets), a program data vector (PDV) is created. Like the input buffer, the PDV is a section of logical memory; but, instead of storing raw, unstructured data, the PDV is where SAS stores variable values. SAS determines these values in potentially many ways: by parsing information in the input buffer, by reading values from structured sources such as Excel spreadsheets or SAS data sets, or by executing programming statements in the DATA step. Just as the input buffer holds a single line of raw text, the PDV holds only the values of each variable for a single record.
In addition to user-defined variables, SAS places automatic variables into the PDV. Two automatic variables, _N_ and _ERROR_, are present in every PDV. By default, the DATA step acts as a loop that repeatedly processes any executable statements and builds the final data set one record at a time. These loops are referred to as iterations and are tracked by the automatic variable, _N_. _N_ is a counter that keeps track of the number of DATA step iterations—how many times the DATA statement has executed—and is initialized to one at invocation of the DATA step. _N_ is not necessarily the same as the number of observations in the data set since programming elements are available to selectively output records to the final data set. Similarly, certain statements and options are available to only select a subset of the variables in the final data set.
The second automatic variable, _ERROR_, is an indicator