file. Following the keyword INFILE, you place the filename in single or double quotes. The LENGTH statement tells SAS that the variable Gender is character (the dollar sign indicates this) and that you want to store Gender in 1 byte (the 1 indicates this). The INPUT statement lists the variable names in the same order as the values in the text file. Because you already told SAS that Gender is a character variable, the dollar sign following the name Gender on the INPUT statement is not necessary. If you had not included a LENGTH statement, the dollar sign following Gender on the INPUT statement would have been necessary. SAS assumes variables are numeric unless you tell it otherwise.
The RUN statement ends the program. Because this program starts with the keyword DATA, it is called a DATA step. The previous two programs demonstrated PROC steps. SAS programs are typically made up of DATA and PROC steps. Each step ends with a RUN statement.
As you did earlier, you can use PROC PRINT to list the observations in the Sample2 data set (as shown in Program 1.4):
Program 1.4: Using PROC PRINT to List the Observations in Data Set Sample2
title “Listing of Data Set Sample2”; proc print data=Sample2; run; |
Here is the listing:
Reading CSV Files
You can make a very small change to Program 1.3 to read the same data from a CSV file. Following is a listing of such a file:
A CSV Text File: c:\books\Statistics by Example\comma.csv
1,23,M 2,33,F 3,18,F 4,45,M 5,41,M 6,,F |
Notice that you no longer need the period in subject 6 because, in the tradition of CSV files, two commas in a row indicate a missing value.
The only change you need to make to Program 1.3 is to use an option called DSD on the INFILE statement. The DSD option specifies that two consecutive commas represent a missing value and that the default delimiter is a comma. Here is the modified program:
Program 1.5: Reading a CSV File
data Sample2; infile “’c:\books\statistics by example\comma.csv”’ dsd; length Gender $ 1; input ID Age Gender $; run; |
This program produces a SAS data set identical to the one created by Program 1.3.
If your CSV file contains variable names in the first row, then the Import Wizard uses these variable names when it creates the SAS data set. Actually, you can use the Import Wizard even if the first row does not contain variable names. If you do, SAS will name the variables F1, F2, etc. This approach is not recommended.
Data Values in Fixed Columns
You might have a raw text file in which the value for each variable is in a fixed column. SAS has two methods for reading this type of data: column input and formatted input. For column input, you follow each variable name on the INPUT statement with the starting and ending column for that value. If you want to create a character variable, you place a dollar sign between the variable name and the column specifications.
For example, if you have ID data in columns 1–3, Age in columns 4–6, and Gender in column 7 of your raw data file, your input statement might look like this:
input ID $ 1-3 Age 4-6 Gender $ 7;
Stylistically, you might prefer to write this statement on three lines, like this (so that the variable names line up):
input ID $ 1-3
Age 4-6
Gender $ 7;
For formatted input, you specify the starting column for the variable using an at sign (@) (called a column pointer) followed by the starting column number. Next, you put your variable name, followed by a SAS informat—a specification of how to read and interpret the next n columns. An equivalent statement to read the same data for ID, Age, and Gender using formatted input is:
input @1 ID $3.
@4 Age 3.
@7 Gender $1.;
The informat $3. tells SAS to read three columns of character data; the 3. informat says to read three columns of numeric data; the $1. informat says to read one column of character data. The two informats n. and $n., are used to read n columns of numeric and character data, respectively.
The INPUT statement is actually quite powerful and enables you to read both simple and complex data structures. For a complete description of how the INPUT statement works, see Learning SAS by Example: A Programmer’s Guide or one of the other publications available from SAS Press.
Excel Files with Invalid SAS Variable Names
What if your Excel file contains variable names in the first row that are not valid SAS names? Take a look at the following spreadsheet:
Three of the four variable names are not valid SAS variable names because they contain either blanks or invalid characters (percent sign and dashes). What happens when you use the Import Wizard to convert this spreadsheet into a SAS data set? SAS substitutes an underscore character in place of each invalid character in the name. A SAS data set created from this spreadsheet would contain the variables ID, Ht_in_Inches, _Fat, and Wt_in_Lbs.
It is possible to use SAS variable names that contain invalid characters. To include such variables, you need to set a system option called VALIDVARNAMES and refer to the variable names using a special notation. Using such variables is not recommended, however, because doing so creates added complications.
Other Sources of Data
The bottom line is that SAS can read data from just about anywhere. Using the Import Wizard, for example, you can read Excel, Access, CSV, tab-delimited, dBASE, JMP (a SAS product), Lotus, SPSS, Stata, and Paradox files. In addition, SAS can read data from most of the major mainframe database systems such as Oracle and DB2.
Conclusions
You now know how to use the Display Manager or other editor to write your SAS programs, and you know how to read your data from a variety of sources. Now you are ready to start using SAS procedures to analyze your data. In the remaining chapters of this book, you will learn how to create descriptive statistics and how to run most of the commonly used inferential statistical tasks.
Chapter 2 Descriptive Statistics – Continuous Variables
Computing Descriptive Statistics Using PROC MEANS
Descriptive Statistics Broken Down by a Classification Variable
Computing a 95% Confidence Interval and the Standard Error
Producing Descriptive Statistics, Histograms, and Probability Plots
Changing the Midpoint Values on the Histogram