Hands-On SAS for Data Analysis

By Harish Gulati
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Introduction to SAS Programming

About this book

SAS is one of the leading enterprise tools in the world today when it comes to data management and analysis. It enables the fast and easy processing of data and helps you gain valuable business insights for effective decision-making. This book will serve as a comprehensive guide that will prepare you for the SAS certification exam.

After a quick overview of the SAS architecture and components, the book will take you through the different approaches to importing and reading data from different sources using SAS. You will then cover SAS Base and 4GL, understanding data management and analysis, along with exploring SAS functions for data manipulation and transformation. Next, you'll discover SQL procedures and get up to speed on creating and validating queries. In the concluding chapters, you'll learn all about data visualization, right from creating bar charts and sample geographic maps through to assigning patterns and formats. In addition to this, the book will focus on macro programming and its advanced aspects.

By the end of this book, you will be well versed in SAS programming and have the skills you need to easily handle and manage your data-related problems in SAS.

Publication date:
September 2019
Publisher
Packt
Pages
346
ISBN
9781788839822

 

Introduction to SAS Programming

In this chapter, we will learn and master basic SAS programming techniques. For the uninitiated of you, this chapter should be a stepping stone to SAS programming. For experienced SAS programmers, this chapter will help you revise some behind the scenes functionalities and tricks of SAS. In either case, this chapter will lay the foundation for how good an advanced SAS programmer you can be. As we progress through this chapter, we will cover the following topics:

  • SAS dataset fundamentals
  • SAS programming language—basic syntax
  • SAS LOG
  • Dataset options
  • SAS operators
  • Formats
  • Subsetting datasets
 

SAS dataset fundamentals

The SAS dataset contains values that are organized as rows and columns that can be processed (read/written) by SAS. The dataset can be a data file (table) or view. Either way, a dataset is typically rectangular in format. The dataset has a descriptor portion and data portion. While in the following table, we can only see the column/variable names, the descriptor portion holds further information such as the number of rows (more commonly referred to as observations) in the dataset, date and time of creation, and the operating environment in which it was created. This section is called the data portion, which holds all the data values:

The maximum number of observations that can be counted for a SAS dataset is determined by the long integer data type size for the operating environment. In operating environments with a 32-bit long integer, the maximum number is 231-1 or approximately 2 billion observations (2,147,483,647). In operating environments with a 64-bit long integer, the maximum number is 263-1 or approximately 9.2 quintillion observations. Operating machines with a 32-bit long integer are likely to reach the maximum observation count of 2 billion observations in some real-world scenarios. However, the 64-big long integer machines are unlikely to ever reach the upper limit of observations permitted.

While dealing with SAS data, we are less concerned with the exact number of observations. It doesn't matter whether they are 5 million or 6 million observations. However, it should be much faster to query a 500-observation table compared to one with 5 million observations. The observations merely help in estimating the processing time. Throughout the book, we will learn about programming techniques that will help speed up processing. In this chapter, we will learn about compression.

The aspect more important than observations is the number of records per ID variable. In the cost of living table, we have 15 observations. Each observation is a record of a different city. In this case, the variable city has become an ID variable. In a transactional table of retail sales, you may have hundreds of records for each loyalty card. The multiple records may represent the basket of goods that have been purchased over a period of time. All the records would be linked to a single loyalty card number.

Data is seldom fully populated for each variable. For example, a data table constructed using responses from a questionnaire may have missing responses from a few respondents if the question that's being asked isn't mandatory. This information may not be available for each variable and such instances would be set to missing values in the table. A period (.) represents a missing numeric record, whereas a space (" ") represents a missing character record.

Please remember that a 0 value and missing values aren't the same.

Creating an SAS table

The task that SAS performs to create a table can be categorized into two phases:

  • Compile
  • Execute

The following flowchart shows us the detailed compilation and execution process in table creation:

Compile phase

The compile phase is one that is often not well understood by the users as this is the backend processing the output that was generated. Tasks within the compile phase include syntax check, input buffer (not created if reading an existing dataset), program data vector (PDV), and descriptor information:

  • Syntax check: In the syntax check task, SAS checks whether the code syntax is correct and then converts the programming statements into machine code to help execute the code. Only if the syntax is correct does SAS proceed with other tasks in the compile phase.
  • Input buffer: The input buffer is a logical area in memory in which SAS reads each record of data from a raw data file where the program executes. In the case when a dataset is created from another SAS dataset, an input buffer is not created.
  • PDV: This is a logical area of memory where SAS builds the dataset by writing each observation one at a time. Data is read from the input buffer. Values are assigned to the variables in the PDV. The values are written to the dataset as a single observation. There are two automatic variables created in PDV, namely _N_ and _ERROR_. Both these variables are not part of the output dataset that's created. The _N_ variable signifies the number of iterations of the data step. The _ERROR_ variable captures the number of instances in each data step when an error occurs.
  • Descriptor information: This contains information about the dataset and includes both the dataset and variable attributes.

Execution phase

In this phase, SAS writes the PDV values to the output dataset for the current observation. The values of the PDV are set to missing. If there are any more records to read, then the program goes back to the top of the data step and executes it again. The next observation is built and stored in the output dataset. This process goes on until there are no more records to read. After this, the dataset is then closed and SAS goes to the next DATA or PROC step (if available) in the program file.

Dataset creation example

Let's look at the steps in the compile and execution phase while creating the cost of living dataset we showcased in the preceding screenshot. We will run the following program to create the dataset:

DATA COST_LIVING;
INPUT City $12. Index Prev_yr_index Housing Food Travel Utility Education Leisure Other;
DATALINES;
Adelaide 85 83 35 10 10 9 14 10 12
Beijing 90 92 40 10 15 10 18 5 2
Copenhagen 65 64 25 15 10 10 12 12 16
Doha 56 50 30 15 5 10 10 20 10
Dubai 75 76 30 16 14 10 20 8 2
Dublin 45 43 30 10 8 12 10 15 15
Hong Kong 83 88 45 5 10 15 15 9 1
Johannesburg 35 40 45 5 5 15 15 10 5
Manila 41 42 25 10 15 15 20 10 5
Moscow 48 53 40 20 5 5 10 10 10
Mumbai 83 85 40 10 15 15 10 9 1
Munich 65 64 35 10 10 10 10 10 15
New York 89 85 40 10 15 10 20 5 5
Oslo 60 58 25 15 5 5 15 20 15
Paris 70 70 30 10 5 10 10 20 15
Seoul 73 75 30 10 10 10 15 15 10
Singapore 75 74 35 15 10 10 20 5 5
Tokyo 87 85 40 15 10 5 15 14 1
Zurich 63 61 30 10 10 15 10 10 15
;
RUN;

In its current form, the program will execute without errors. This is because the first phase of compile that checks for syntax errors will not come across any coding errors. For illustration purposes, we can try and remove the ; after the DATALINES command. The following error will be encountered when we try to run the modified code and no output table will be generated:

Let's review the steps in SAS processing for the preceding date creation program to understand how the PDV is generated. After the syntax check is done, the input buffer and the PDV are created. The PDV contains all the variables that are declared in the input statement. Initially, all the variable values are set to missing. The automatic _N_ and _ERROR_ variables are both set to 0:

Input Buffer
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-

PDV:

City

Index

Prev_yr_index

Housing

Food

Travel

Utility

Education

Leisure

Other

.

.

.

.

.

.

.

.

.

 

The City variable has been declared as a character variable and the rest of the variables are numeric. The missing character values are written as blanks and the missing numeric values are written as periods. After this stage, the data step executes and the data values are first assigned to the input buffer before being written to the PDV:

Input Buffer
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-
Adelaide 85 83 35 10 10 9 14 10 12

PDV:

City

Index

Prev_yr_index

Housing

Food

Travel

Utility

Education

Leisure

Other

Adelaide

85

83

35

10

10

9

14

10

12

 

At this point, SAS writes the data values in the PDV to the output dataset. The _N_ variable is set to 1 and _ERROR_ is set to 0. The PDV is set to missing values. Since we have more lines of data to read, the program will keep executing:

Input Buffer
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-
Adelaide 85 83 35 10 10 9 14 10 12

PDV:

City

Index

Prev_yr_index

Housing

Food

Travel

Utility

Education

Leisure

Other

 

.

.

.

.

.

.

.

.

.

 

For the second line of data, the following values in the input buffer and PDV will be written:

Input Buffer
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-
Beijing 90 92 40 10 15 10 18 5 2

PDV:

City

Index

Prev_yr_index

Housing

Food

Travel

Utility

Education

Leisure

Other

Beijing

90

92

40

10

15

10

18

5

2

 

The observations will now be written to the dataset and the variable _N_ will be incremented by 1 to take its value to 2. The _ERROR_ variable will again be reset to 0 as no errors have been encountered. This process will continue until the time the last data observations have been read by the program and sent to the output dataset.

 

SAS programming language – basic syntax

We used code in the first program to create an output dataset. The dataset was created by what is known as data steps. Data steps are a collection of statements that can help create, modify, and control the output. SAS also leverages Structured Query Language (SQL). Let's review the basic syntax for data steps and SQL within SAS. We will continue to explore more advanced versions of code throughout this book.

Data step

The following code represents one of the simplest forms of a data step:

DATA WORK.Air;
SET SASHELP.Air;
RUN;

Using the SET statement in this data step, we have specified the dataset that we want to refer to. There are no conditional statements in the program that are selecting a proportion of records from the Air dataset in the SASHELP library. As a result, all the contents of the dataset in the set statement will get copied over to the dataset specified in the data statement. In the set statement, it is necessary to specify a dataset. However, if you specify _LAST_ in the SET statement, the dataset that was last created in the SAS session will be used instead. Finally, the Run command is specified to execute the program. The only instance when the command isn't needed in a data statement is when creating a dataset using an INPUT statement (as shown in the following code block).

The use of DATA signifies that we are using the data step. In this statement, we specify the output dataset. WORK is what is known as a library in SAS. A library is like a Windows folder that stores files and various other things such as formats and catalogs. Every SAS session (each instance of a SAS software invocation is a separate session) is assigned its own temporary workspace. The temporary workspace is known as the Work library. At the end of the session, the temporary work session is cleared and unless saved in a permanent library, all the contents of the Work library are deleted. If the Work library is not specified, the dataset will be created in the temporary workspace by default.

A permanent library is one that can be assigned using a physical path. In BASE SAS, this can be a physical folder located in the computer drives. In the case of SAS Enterprise Guide Studio and other software, this may be a space on the server. The dataset name consists of two parts—the library name followed by the dataset name. Both are separated by a period. For the creation of datasets in the Work library, users only need to specify the dataset name. If no dataset name is specified, SAS names the dataset as D1, D2, and so on:

DATA;
INPUT Id;
DATALINES;
1
2
;
RUN;

Since this is the first program in our SAS session without a dataset name specified, the name D1 will be assigned to the dataset:

We will explore the data step options in further detail throughout this book.

Proc SQL

SAS leverages SQL through a built-in procedure. While we aren’t going to focus on SQL in this book, let's look at the basic structure of a SQL query in SAS:

PROC SQL;
CREATE TABLE Table_Name AS
SELECT
FROM
WHERE
GROUP BY;
QUIT;

PROC is the command that's used to specify a built-in procedure in SAS. In the case of the preceding program, we are referring to the SQL procedure. Just like in the data step, we start off by specifying the table name that we are creating. We then list the variables that we want to select from another dataset. The name of the dataset that has been selected is named in the FROM statement. The WHERE clause is used to sub-select the data. The GROUP BY clause is used for summary functions. Finally, we end the procedure by specifying the QUIT argument.

 

SAS LOG

The SAS LOG section of your coding environment is where all the actions performed that have been by the user in the current session are stored. These include instances of program submission and also messages about any programs that you might have terminated while they were executing. Apart from these, the SAS LOG also contains system-generated messages. These are of two types. The first instance is where the SAS version and a few other details about your system are written to the LOG. The second is when responses to the user code are generated and written to the LOG. The response could state that the program has run successfully, failed, or has some syntax issues that have been ignored. The responses are categorized in NOTE, INFO, WARNING, and ERROR categories. Program submission messages can be easily identified in the LOG as they have a line number associated with them.

Let's examine the LOG that's generated after running the first program from the preceding Data step section:

On the left-hand side of the LOG is the line number. Line number 1 contains the default settings that are in place for this SAS session. The log for our program starts getting generated in line 73. From line 73 to line 75, the program that has been specified in the program editor window is replicated in the log. There are no errors being produced in the log, unlike the one shown in the Data creation example section. The note that's produced mentions the number of observations read from the input dataset. It also contains the number of observations and variables in the output dataset. The time it took to execute the query is also mentioned at the end of the notes.

By reading the LOG, the user can review the program executed and notes, warnings, or errors produced, review the summary produced about the input and output dataset, and check query execution time. After the first run, the user may want to modify the program. This could be because the output is not in sync with the intended requirement or an error has been generated. In any case, understanding of the log is required before we can edit the program.

Naming conventions in SAS

Some of the frequently used functionalities in SAS where naming conventions need to be followed are variables, datasets, formats or informats that are user-created, arrays, labels, macro variables, library names, and file references.

The general rules for SAS names are as follows:

  • Variable names can be up to 32 characters, whereas some other names such as library names can be up to 8 characters.
  • SAS isn't case sensitive in name specification, unlike some other programming languages or other statistical packages.
  • The name cannot start with a number. It needs to start with a letter or an underscore. The second character can be a number or an underscore. No special characters apart from underscore are allowed in the name. Underscores are frequently used by programmers for variable names where multiple words are involved, for example, Order_Date, Payment_Date, or Delivery_Date. A variable name cannot contain a blank. Hence, the underscore becomes an important way to make the variable names more legible for users of your code and data.
  • In some instances of SAS names for filerefs, some special characters are allowed.
  • Some names are reserved for SAS functions and keywords that are used by the system. For instance, you cannot specify a library name that is the default SAS library associated with your SAS installation. These include SASHELP, SASUSER, and WORK.

SAS already has a macro variable called sysdate. Users shouldn't attempt to create a macro variable with the same name.

The maximum length of arrays, labels, variables, numeric formats, macros, and datasets is 32. Character formats have a length of 31. Character and numeric informats have a length of 30 and 31, respectively. File references and library names have a maximum length of 8.
 

Dataset options

There are many built-in SAS options that apply to the dataset. The broad purpose of these options is to help us do the following:

  • Rename variables
  • Select variables for subsetting
  • Retain select variables
  • Specify the password for a dataset, compress it, and encrypt it

Throughout this book, we will be looking at various dataset options. We will begin by exploring the compress, encrypt, and index options.

Compression

Compression reduces the number of bytes that are required to store each observation. The advantage of compression is that it requires less storage, owing to the fact that fewer bytes are required and fewer I/O operations are necessary to read or write to the data during processing.

The biggest disadvantage of compression is that more CPU resources are required to read a compressed file and there are situations where the resulting file size might increase rather than decrease the time that's required to execute SAS statements. This is one of the reasons why many SAS users end up not compressing datasets and miss out on the advantages. Users need to be aware that some SAS administrators are known to make compression the default option on a server level.

The COMPRESS=YES dataset helps compress a file. This option works with a SAS data file and not a view. This effects only the dataset specified in the output statement and not the dataset in the SET statement. The best way to uncompress a file is to create a new file. Alternatively, you can use COMPRESS=NO while recreating the dataset. Remember that the YES option remains activated for the rest of your SAS session unless the NO option is specified.

The benefit of compression can be gauged by looking at the SAS LOG after compression. A message is generated in the LOG stating the percentage reduction in size of the compressed file compared to the size that would have been in an uncompressed state:

DATA WORK.Air (COMPRESSION = YES);
SET SASHELP.Air;
RUN;

The following note is written to the SAS log, specifying that compression was disabled as it isn’t advantageous:

NOTE: Compression was disabled for data set WORK.AIR because compression overhead would increase the size of the data set.

Here, we know the dataset was produced without utilizing the compress option.

Encryption

While encrypting a dataset, you have to use the READ or the PW dataset option. If the password is lost, then only the SAS helpdesk can resolve the situation. The only other way to change the password is to recreate the dataset. We will attempt to encrypt the Air dataset using the following code block:

DATA WORK.Air (ENCRYPT = YES READ=CHAPTER2);
SET SASHELP.Air;
RUN;

PROC PRINT DATA = WORK.Air(READ=CHAPTER2);
RUN;

We will have to specify the password to be able to read the dataset using the print procedure.

Indexing

As the name suggests, indexing is a way to tell SAS how things have been arranged. Typically, the SAS dataset is stored in pages. There is no way for SAS to know on which page or sequence the information is stored unless the dataset is indexed. The index stores values in ascending value order for a specific variable or variables and includes information as to the location of those values within observations in the data file. In other words, an index allows you to locate an observation by value.

Indexes can be done on one variable or multiple variables. If done on multiple variables, they are called composite indexes. Indexes needed to be stored somewhere for the subsequent programs using the indexed data to be able to leverage them. They are stored as a separated data file and all indexes of a dataset are stored in a single file. Indexes aren’t automatically transferred to the dataset that is created using an indexed input dataset. However, if you add or delete observations in the indexed dataset, the index is automatically updated. The good thing about indexes is that they can be created at the stage of creating a dataset or at any time afterwards. The dataset doesn’t need to be compressed for indexing.

The pros of creating an index are as follows:

  • For WHERE processing, an index can provide faster and more efficient access to a subset of data.
Note that to process a WHERE expression, SAS by default decides whether to use an index or to read the data file sequentially. Please note that the IF condition never leverages the index.
  • For BY processing, an index returns observations in the index order, which is in ascending value order, without using the SORT procedure, even when the data file is not stored in that order.

The biggest disadvantage of index creation is the cost associated with it. Storage of the runtime and the time to create the index are big overheads. By default, indexes should never be created. These should only be created if the dataset in question is going to be queried repeatedly. Another clear instance when an index should be created is if its presence is significantly reducing the runtime associated with the analytical tasks the users are performing. The user should be prudent about which variables in the dataset need an index.

The syntax for index creation is the following:

INDEX CREATE index-specification-1 <...index-specification-n>
</ <NOMISS> <UNIQUE> <UPDATECENTILES= ALWAYS | NEVER | integer>>;

Arguments for the syntax are index-specification(s).

It can be one or both of the following forms:

  • variable: Creates a simple index on the specified variable
  • index=(variables): Creates a composite index

The optional arguments are the following:

  • NOMISS: Excludes all observations with missing values for all index variables from the index 
  • UNIQUE: Specifies that the combination of values of the index variables must be unique
  • UPDATECENTILES=ALWAYS | NEVER | integer: Specifies when centiles are to be updated

Let's create a simple index using the following code block:

PROC DATASETS LIBRARY=WORK;
MODIFY Cost_Living;
INDEX CREATE City;
RUN;

On running, the following messages are written to the log:

NOTE: Simple index City has been defined.
NOTE: MODIFY was successful for WORK.COST_LIVING.DATA.

To create a composite index on the City and Index variables, use the following statement in PROC DATASETS:

INDEX CREATE City_and_Index = (City Index);

The key difference in syntax between single and composite indexes is that you have to specify a unique name for the composite index.

 

SAS operators

An SAS operator is a symbol that represents a comparison, arithmetic calculation, or logical operation. We will look into the various operator in the following sections.

Arithmetic operators

For performing arithmetic operations, the following can be used in SAS:

Symbol

Definition

+

Addition

/

Division

**

Exponentiation

*

Multiplication

-

Subtraction

Comparison operators

Comparison operators set up a comparison, operation, or calculation with two variables, constants, or expressions. If the comparison is true, the result is 1. If the comparison is false, the result is 0. The following table show the symbol, mnemonic equivalent, and definition of each comparison operators in SAS:

Symbol

Mnemonic equivalent

Definition

=

EQ

Equal to

^= (or ¬=, ~=)

NE

Not equal to

>

GT

Greater than

<

LT

Less than

>=

GE

Greater than or equal

<=

LE

Less than or equal

 

IN

Equal to one of a list

Logical operators

Logical operators, also called Boolean operators, are usually used in expressions to link sequences of comparisons. The logical operators are shown in the following table:

Symbol

Mnemonic equivalent

&

AND

| (or !, ¦)

OR

¬ (or ˆ, ~)

NOT

 

Formats

There are usually three scenarios when formats are used in SAS:

  • To format the data available to make it more readable
  • To assign a specific format to make the data more meaningful
  • To alter the data type to make calculations, derivations, and so on

Formatting to make the data readable

SAS stores all dates as single unique numbers in a numeric format. All dates are stored as the number of dates from January 1, 1960. If you think of a number line, all dates prior to January 1, 1960 are negative and all dates after that are positive. It’s cumbersome to decode the numeric dates. An easier alternative is to get SAS to display a date using a date format. Let's look at our cost of living dataset with the addition of a new date variable:

DATA COST_LIVING;
INPUT City $12. Index Prev_yr_index Housing Food Travel Utility Education Leisure Other Updated MMDDYY6.;
DATALINES;
Adelaide 85 83 35 10 10 9 14 10 12 010118
Beijing 90 92 40 10 15 10 18 5 2 010118
Copenhagen 65 64 25 15 10 10 12 12 16 020118
Doha 56 50 30 15 5 10 10 20 10 030118
Dubai 75 76 30 16 14 10 20 8 2 040118
.
.
.
;

The input data is available in a month, date, and year format represented by six numbers. To ensure that SAS can store this date as a number, we need to specify what is known as an informat. Here, MMDDYY6. is an informat with the length of the variable as six digits:

In the preceding table, we can see that the date of index value in Updated for Adelaide is 21185. But is this the value that we get when we do January 1, 2018 – January 1, 1960?

A leap year occurs every 4 years, except for years that are divisible by 100 and not divisible by 400.

So, the mean length of the Gregorian calendar year is as follows:

1 mean year = (365+1/4-1/100+1/400) days = 365.2425 days

There are 58 years between the update for Adelaide and the start of the SAS date value. This gives us a number of 58 x 365.2425 = 21184.065. This is close to the value of 21185 we got. In fact, if we multiply the number of years (58) by the sidereal year (the time it takes for Earth to do a single rotation around the Sun) value of 365.25636, we get a value of 21184.86888. This value, when rounded up, gives us the numeric value of 21185 for the updated variable for the first observation.

If you had the date value of January 1, 1960, SAS will store it as 0. January 2, 1960, will be stored as 1. However, December 31, 1959 will be stored as -1. Please don’t get confused when you see negative date values. All you need to do is specify a date format to make the value readable.

We will now add the following Format statement between the INPUT and DATALINES arguments to the preceding code:

Format Updated Date9.;

This produces an easily readable date value. Remember to add a date format when dealing with dates. An input format is also required to ensure SAS correctly interprets the date:

Specifying a format to make it meaningful

Consider the example of a management information (MI) reporting analyst based in California. The analyst will have to extract information about the sales performance of each region from the central data warehouse to present a global sales performance MI report.

Should the analyst be expected to intuitively know in which currency the revenue of each region is stored in the warehouse? It would be helpful if formats were assigned to the revenue to signify whether the local or a commonly used currency (such as the US Dollar, Yen, Euro, or British Pound) has been used to store the regional revenue figures.

The following statement, when added between the DATA and SET statements, converts the numerical index value into dollars:

FORMAT Index dollar6.2;

In the preceding snippet, the .2 represents the two decimal spaces that have been specified. 6 is the length of the format, which includes the dollar symbol. If we had specified 5.2 as the length of the format, then the output value would have been in the format 85.00 instead of $85.00. The syntax of the format is w.d, where w is the width of the format and d is the number of decimal places required. Specifying the format allows for adequate space to write the number along with the decimal specification, a period, and a negative sign. The width of the format can take up any number from 1-32. The decimal specification should be less than the width. If you don't specify a decimal specification, then the value will be written without the decimal point by default.

The following is the output that's produced after specifying the dollar format:

The syntax for some of the other currency formats is as follows:

FORMAT Index Euro7.2;/*Euro*/
FORMAT Index NLMNLGBP9.2; /*British Pound*/
FORMAT Index NLMNLJPY9.2; /*Japanese Yen*/
FORMAT Index NLMNLAUD9.2; /*Australian Dollar*/
FORMAT Index NLMNLNZD9.2; /*New Zealand Dollar*/
/*This is a comment*/;
In SAS, comments are written between '/* */ ' or, at the start of a statement, a comment can be specified by prefixing an asterisk before a statement. If prefixing an asterisk, remember to end the sentence with a semicolon, otherwise else the next statement will also be commented by default and will not execute.

Altering the data type

While processing data, you will frequently need to change data types. Even if the data is of good quality, where numbers have been stored as numerical and all date fields have been specified with the correct date format, transformations or derivations from them may be required to perform analysis.

Some of the commonly used transformations and derivations for a data analyst where the data type is altered are the following:

  • Converting from numerical into character
  • Converting from character into numerical
  • Using a date or datetime format
  • Extracting information from data

To convert a number into a character, use the PUT function. The syntax is as follows:

Character_variable = put(numeric_variable, informat.);

To convert the index variable in the cost of living dataset into a character, use the following code:

DATA Num_to_Char;
SET Cost_Living;
Index_char = PUT(Index, 3.);
RUN;

First, we converted a variable from numeric into character format and then added a new variable to the dataset.

To convert from character to numeric, we use the INPUT function. The syntax is as follows:

Numeric_variable = input(character_variable, informat.);

Let’s run the following code to create a dataset that will help us understand the character into numeric conversion:

DATA Convert;
INPUT Id_Char $4. Turnover $7. Turnover_w_Currency $8. Source_Mixed $3.;
DATALINES;
0001 20,000 $20,000 A1
0002 10,000 $10,000 2
;

This gives us the following dataset:

To convert the Id_char variable to numeric, we can use the following statement in a data step:

Id_Num = INPUT(Id_Char, 5.);

This will remove the leading zeros and give us the values 1 and 2 for the variable. But what if we wanted to preserve the leading zeros? In that case, we should use the following code:

Id_Num_Leading_Zero = INPUT(Id_Char, $5.);

To convert the Turnover_w_currency variable to numeric data type, we need to use the correct informat:

Turnover_Num = INPUT(Turnover_w_currency, dollar8.);

The Source_mixed field has alphanumeric data. When we use the following statement to convert it into a numeric variable, we get missing data for observation 1. Users should be careful while deploying numeric conversion using automated scripts. Missing data can be deemed acceptable in some circumstances. However, losing the alphanumeric data might negatively impact data quality in some instances:

Source_numeric = INPUT(source_mixed, 3.);

We have looked at date values and how SAS interprets them as numbers. The date variable with time information is also stored as a number in SAS. The informat and format used for interpretation are different from those used for a date variable. We will try and understand handling date-time values using the following example:

DATA DateTime;
INPUT Id Date_Time Datetime20.;
DATALINES;
1 01aug19:09:10:05.2
2 01aug20:19:20:10.4
;

DATA Convert_DateTime;
SET DateTime;
FORMAT Orig_Date Datetime.;
Orig_Date = Date_Time;
FORMAT Orig_Date_1 Datetime7.;
Orig_Date_1 = Date_Time;
FORMAT Orig_Date_2 Datetime12.;
Orig_Date_2 = Date_Time;
RUN;

The datetime values must be in the following form: ddmmmyy or ddmmmyyyy, followed by a blank or special character, followed by hh:mm:ss.ss (the time), where hh is the number of hours ranging from 00 through 23, mm is the number of minutes ranging from 00 through 59, and ss.ss is the number of seconds ranging from 00 through 59 with the fraction of a second following the decimal point. Only the ss.ss portion is optional in the datetime value.

We get the following output when we create the DateTime table using informats:

Without the time variable, the values for IDs 1 and 2 (dates August 01, 2019 and August 01, 2020) will be stored in SAS as 21762 and 22128. The addition of the time component to the date changes the value of the number stored in SAS significantly.

To make the datetime information stored in SAS readable, we used different formats in DATA Convert_DateTime and got the following output:

For date and time values, we have looked at how to read and output them in SAS. We have also managed to format these dates in a way that we only see them in a manner relevant to us. On occasions, we might just want to extract some information from the date or the datetime variable. This information could be stored in a non-date/datetime variable such as the numeric variable.

We can use the following statements to extract the Year, Month, and Date values from a Date variable with an informat of mmddyy:

Year = YEAR (Date);
Month = MONTH (Date);
Day = DAY (Date);

For the date August 1, 2019, we will get the year as 2019, month as 8, and day as 1 when we use the preceding statements in a data step. The Year, Month, and Day variables will be created as numeric variables.

If we have a Date_Time variable, we need to extract the date part from it before converting it into a Year, Month, and Date variable. We can use the following statements:

Year = YEAR (DATEPART(Date_Time));
Month = MONTH (DATEPART(Date_Time));
Day = DAY (DATEPART(Date_Time));

For the datetime value 01aug19:09:10:05.2, we will get the values 2019, 8, and 1 for the year, month, and day variables respectively.

 

Subsetting datasets

In many instances, we will need to select only a proportion of the overall dataset. This selection or subsetting of the dataset can be done in three ways:

  • Use of the WHERE and IF statements
  • Using the SAS dataset OPTIONS
  • Using the DROP and KEEP statements

WHERE and IF statements

Let's only select the data where the records have been updated in 2019:

DATA Updated_2019;
SET Cost_Living;
WHERE Year(Updated) = 2019;
RUN;

The following output will be generated:

In this instance, we could have also used the IF statement to generate similar output. We can also use WHERE and IF statements in the same data step:

DATA Updated_2019;
SET Cost_Living;
WHERE Year(Updated) = 2019;
IF Index >= 80;
RUN;

This further subsets the data and produces the following output:

While subsetting, we should be aware of whether the WHERE or IF statement will be applicable in a data step. The WHERE statement executes at the PDV level, that is, only the selected/subset data will be read in the PDV. This can considerably reduce the time to execute a query. The IF statement can only be executed after all the data has been read into the PDV. The WHERE statement should be preferred over the IF statement if the dataset being read is indexed. The WHERE statement allows SAS to directly retrieve the specified value from the indexed table.

Because the WHERE statement executes at the PDV level, it cannot leverage any new variables that are created in the data step. The following code block is an example of the incorrect use of the WHERE statement:

DATA Known_Components;
SET Index;
Known_Component_Index = Index–Other;
WHERE Known_Component_Index >= 80;
RUN;

This produces the following error in the Log:

We can instead replace the WHERE statement with the following IF statement:

IF Known_Component_Index >= 80;

The program executes without any errors and the output dataset is created with five observations, where known_component_index is greater than or equal to 80.

Using OPTIONS

We can restrict the number of records being read using the OBS and the FIRSTOBS options:

PROC PRINT DATA = Cost_Living (FIRSTOBS=4 OBS=5);
RUN;

This will give us the following output:

The option can also be specified outside the data step:

OPTIONS OBS = 2;

Using OPTIONS in SAS is effective for subsetting but only when the criterion is the number of records and when the starting position is observations is known or doesn’t matter.

DROP or KEEP options

There will be countless occasions when a user will need to use the DROP or KEEP options. These options help retain the records that are needed in the output dataset. Using certain programming statements can help reduce the execution time of the program by subsetting the records from the input dataset. Let's look at the following example:

DATA Keep_and_Drop (DROP = Prev_yr_index);
SET Cost_Living (KEEP = City Index Prev_yr_index);
WHERE Index < Prev_yr_index;
RUN;

This produces the following output:

The KEEP option helped restrict the input dataset being read to include only the City, Index, and Prev_yr_index variables. After this, the WHERE statement was executed. Finally, using the DROP statement, we removed the Prev_yr_index variable from the output dataset. Another way to use the KEEP option can be as a separate statement within a data step:

KEEP City Index Prev_yr_index;

Viewing properties

The content procedure shows the details of the datasets and prints the directory of the SAS library. The following is the basic form of the contents procedure:

PROC DATASETS Library=Work;
CONTENTS DATA=Cost_Living;
RUN;

It produces a fairly descriptive output:

The DATASETS procedure produces details of only the dataset specified in the program:

The engine/host-dependent information can be leveraged for optimizing the ability to store a dataset. For our purposes, this information isn’t relevant as we are dealing with a small dataset:

The list of variables provides the type of the variable, its length, and its format information:

 

Dictionary tables

Some SAS software packages require minimal coding experience to get your work done. However, once you want to do anything advanced, it gets restrictive to use just the predefined options available in the package. At that moment, most users tend to write code. Chances are that if you want to go beyond what SAS already offers as predefined procedures and functionalities, you want to write advanced programs/macros. While we will delve into macros later on in this book, there is an aspect of SAS that is being leveraged less by some users. These are the dictionaries. But what are they and how can they be useful?

Some reasons for using dictionary tables are as follows:

  • They hold all the information about SAS libraries, datasets, macros, and external files that are being used in the SAS session.
  • It is a read-only view, so there are no chances of it being compromised.
  • You can get the most up-to-date information about your session as every time you access the table, SAS determines the current state and shares the most up-to-date information.
  • The information in the dictionary tables can help in writing advanced code or can be used as a basis to select observations, join tables, and so on.

The dictionary tables can be accessed easily using PROC SQL. They can be also accessed via the data steps but that entails referring to the PROC SQL view of the table in the SASHELP library. While we aren’t going to focus on PROC SQL in the book, for ease of demonstrating the power of dictionary tables, I am going to use SQL.

In case you have doubts about using dictionary tables to view information about your own SAS session, think again. SAS sessions can get complex quickly. You may have imported files, thousands of lines of codes, or tens of process nodes in a SAS Enterprise Guide sort of package. It all stacks up pretty quickly and this is where dictionary tables come in handy. If you want to build any application on the back of your SAS session, the information that's held in these tables is useful.

The first step is to understand the structure of the indexes that are created in the dictionary tables. Run the following command:

PROC SQL;
Describe Table Dictionary.Indexes;

We get the following notes in the logs:

NOTE: SQL table DICTIONARY.INDEXES was created like:

create table DICTIONARY.INDEXES
(
libname char(8) label='Library Name',
memname char(32) label='Member Name',
memtype char(8) label='Member Type',
name char(32) label='Column Name',
idxusage char(9) label='Column Index Type',
indxname char(32) label='Index Name',
indxpos num label='Position of Column in
Concatenated Key',
nomiss char(3) label='Nomiss Option',
unique char(3) label='Unique Option',
diagnostic char(256) label='Diagnostic Message from File Open Attempt'
);

Understanding the preceding information is critical if we want to be able to utilize dictionary tables. The notes are in the following format:

  • The first word on each line is the column (or variable) name, that is, the name that you need to use when writing a SAS statement that refers to the column (or variable).
  • Following the column name is the specification for the type of variable and the width of the column.
  • The name that follows label= is the column (or variable) label.

Now, we will run a dictionary table query using the given notes. The following query has been run on a new SAS session that has no imported files, programs that have been run, or indexes that have been created:

PROC SQL;
Select * From Dictionary.Tables;
QUIT;

Let's look at the output:

The preceding table is a snapshot of the dictionary table output.

We got hundreds of rows of output when we requested the contents of dictionary tables for our SAS session. This is because there is a lot of information stored in the SAS backend about the SAS help files and settings even before any code is run in the session. Let's leverage the information we have so far: we've learned about the structure of the dictionary tables to reduce the information we want to see. We will now run the following code in the session after creating the COST_LIVING dataset:

PROC SQL;
Select * From Dictionary.Tables
Where Libname eq "WORK"
And Memname eq “COST_LIVING”;

Select * From Dictionary.Columns
Where Libname eq "WORK"
And Memname eq “COST_LIVING”;
QUIT;

We get the following output, which is a snapshot of the Work library:

The column information provides a lot of details about the variable type, length, the column number in the table, formats, index, and so on.

The following are the most commonly used dictionary tables and their purpose. For a full list of dictionary tables and views and their use, please refer to the SAS installation documentation:

Dictionary table

Purpose

COLUMNS

Information related to columns.

FORMATS

Lists all the formats defined.

INDEXES

Informs whether indexes have been created.

LIBNAMES

Lists all the library names. Usually, this is the first port of call if you are unaware of the various libraries.

MACROS

Lists any defined macros.

OPTIONS

All SAS system options in place are listed.

TABLES

Only information on currently defined tables is provided.

TITLES

Currently defined titles and footnotes are listed.

VIEWS

Currently defined data views are listed.

Role of _ALL_ and _IN_

The _ALL_ functionality can be used for notes in the SAS log or to call all the variables that have been specified in the data step. Let's illustrate the first use by modifying the dataset we used while understanding the usage of WHERE and IF:

Data Updated_2019;
Set Cost_Living;
Where Year(Updated) = 2019;
If Index >= 80;
Put _All_;
Run;

This will give us the following output:

City=Mumbai Index=83 Prev_yr_index=85 Housing=40 Food=10 Travel=15 Utility=15 Education=10 Leisure=9 Other=1 Updated=02JAN2019
_ERROR_=0 _N_=3
City=New York Index=89 Prev_yr_index=85 Housing=40 Food=10
Travel=15 Utility=10 Education=20 Leisure=5 Other=5 Updated=02JAN2019
_ERROR_=0 _N_=5
City=Tokyo Index=87 Prev_yr_index=85 Housing=40 Food=15 Travel=10 Utility=5 Education=15 Leisure=14 Other=1 Updated=02JAN2019
_ERROR_=0 _N_=10
NOTE: There were 11 observations read from the data set WORK.COST_LIVING.
WHERE YEAR(Updated)=2019;
NOTE: The data set WORK.UPDATED_2019 has 3 observations and 11 variables.

We can see that this particular usage of _All_ has led to all the variables in the dataset to be listed.

Use the following code block to list the variable names in the PROC PRINT data step; we just asked for all the variables to be printed using _ALL_:

PROC PRINT DATA = Updated_2019;
VAR Updated;
RUN;

Title '_ALL_ in a Data Step';
PROC PRINT DATA = Updated_2019;
VAR _ALL_;
RUN;

This will give us the following output:

As we know, _N_ is initially set to 1. Each time the DATA step loops past the DATA statement, the _N_ variable increments by 1. The value of _N_ represents the number of times the DATA step has iterated. Let's look at selecting observations based on _N_:

Data Test;
Set Updated_2019;
If 1 < _N_ <10;
Run;

This will give us the following output:

Using the _N_ automatic counter, we have been able to exclude the first observation and print only the remaining observations. If you compare this with the earlier output of the table, you will see that the observation where the city is Mumbai has not been output.

But be careful when using _N_. I will demonstrate this by trying to create a variable based on _N_:

Data Automatic;
Input A $ B;
Counter = _N_;
Datalines;
X 1
Y 2
Z 3
;
Run;

Data Automatic_Challenge;
Input A $ B;
Retain Counter 2;
_N_ = Counter+1;
Test_N = _N_;
Datalines;
X 1
Y 2
Z 3
;
Run;

This will give us the following output:

In the first program, we can see that the automatic variable _N_ has a value between 1 and 3 and this corresponds to the observation count. In the second program, we have used the RETAIN function to assign the value 2 to the variable counter. We have created a derived variable called _N_. When we assign a new variable and try to point it to the automatic variable _N_, we get the value of the derived variable. We have, in this instance, confused SAS about which _N_ we are referring to, that is, the automatic variable or the derived one. Please don't assume that just because _N_ is an automatic variable if you use _N_ without caution, then SAS will still understand that you want the automatic variable's value to be output.

 

Summary

In this chapter, we covered the basics of SAS dataset fundamentals and how to create a table, compile it, and then execute it. We also looked into the basic syntax of the SAS programming language. Then, we learned how to compress, encrypt, and index a dataset. Finally, we learned the various operators in SAS, and how to format and subset the dataset.

In the next chapter, we will learn how to manipulate, sanitize, and transform data.

About the Author

  • Harish Gulati

    Harish Gulati is a consultant, analyst, modeler, and trainer based in London. He has 16 years of financial, consulting, and project management experience across leading banks, management consultancies, and media hubs. He enjoys demystifying his complex line of work in his spare time. This has led him to be an author and orator at analytical forums. His published books include SAS for Finance by Packt and Role of a Data Analyst, published by the British Chartered Institute of IT (BCS). He has an MBA in brand communications and a degree in psychology.

    Browse publications by this author
Book Title
Unlock this book and the full library for only $5/m
Access now