Cover: SAS for R Users by Ajay Ohri

SAS for R Users

A Book for Data Scientists


Ajay Ohri





Delhi, IN





Wiley Logo.




This book is dedicated to my students and my family, my son Kush Ohri, members of my church, and my God Jesus Christ.

Preface

I would like to thank the generosity of the SAS Institute and its employees to provide SAS On Demand for Academics for free without whom this book would not exist. In addition, I also want to thank the baristas from Starbucks Gurgaon. These are the people who downvote my questions on Stackoverflow. You inspire me guys.

SAS for R users is aimed at entry‐level data scientists. It is not aimed at researchers in academia nor is it aimed at high‐ end data scientists working on Big Data, deep learning, or machine learning. In short, it is merely aimed at human learning business analytics (or data science as it is now called).

Both SAS and R are widely used languages and yet both are very different. SAS is a programming language that was designed in the 1960s which is broadly divided into Data Steps and a wide variety of Procedure or PROC steps, while R is an object oriented, mostly functional, language designed in the 1990s.

There are many, many books covering either but only very few books covering both.

Why then write the book? After all, I have written two books on R, and one on Python for R. SAS language remains the most widely used language in enterprises, contributing directly to the brand name, and profitability of one of the largest private software companies that invests hugely in its own research instead of borrowing research in the name of open source. A statistics student knowing Python (esp Machine Learning ML), R, SAS, Big Data (esp Spark ML), Data Visualization (using Tableau) is a mythical unicorn unavailable to recruiters who often have to settle for a few of these skills and then train them in house.

As a teacher, I want my students to have jobs – there is no ideological tilt to open source or any company here. The probability of students getting jobs from campus greatly increases if they know BOTH SAS and R not just one of them. That is why this book has been written.

Scope

This book is designed for professionals and students; people who want to enter data science and who have a coding background with some basics of statistical information. It is not aimed at researchers or people who like giraffes and do not read the book from the beginning.

1
About SAS and R

Here is a brief introduction about R and SAS,instructions about installations and a broad high‐level comparison.

1.1 About SAS

SAS used to be called the Statistical Analysis System Software suite developed by the SAS Institute for advanced analytics, business intelligence, data management, and predictive analytics. Developed at North Carolina State University from 1966 until 1976, when the SAS Institute was incorporated. It was then further developed in the 1980s and 1990s with the additional statistical procedures and components. SAS is a language, a software suite and a company created by Anthony James Barr and James Goodnight along with two others. For purposes of this book we will use SAS for SAS computer language.

  • SAS also provides a graphical point and click user interface for non‐technical users.

While a graduate student in statistics at North Carolina State University, James Goodnight wrote a computer program for analyzing agricultural data. After a few years, James's application had attracted a diverse and loyal following among its users, and the program's data management and reporting capabilities had expanded beyond James's original intentions.

In 1976, he decided to work at developing and marketing his product on a full‐time basis, and the SAS Institute was founded. Since its beginning, a distinguishing feature of the company has been its attentiveness to users of the software. Today, the SAS Institute is the world's largest privately‐held software company, and Dr. James Goodnight is its CEO. He continues to be actively involved as a developer of SAS System software as well as being one of the most widely respected CEOs in the community.

The SAS System has more than 200 components

  • Base SAS – Basic procedures and data management
  • SAS/STAT – Statistical analysis
  • SAS/GRAPH – Graphics and presentation
  • SAS/OR – Operations research
  • SAS/ETS – Econometrics and Time Series Analysis
  • SAS/IML – Interactive matrix language

The SAS University Edition includes the SAS products Base SAS®, SAS/STAT®, SAS/IML®, SAS/ACCESS® Interface to PC Files, and SAS Studio. SAS has an annual license fee and almost 98% return to SAS every year, voting by their chequebook. All these products are Copyright © SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513, USA. (https://decisionstats.com/2009/08/20/the‐top‐decisionstats‐articles‐part‐1‐analytics/and https://en.wikipedia.org/wiki/SAS_(software))

1.1.1 Installation

While SAS Software for Enterprises is priced at an annual license, for students, researchers and learners you can choose from the SAS University Edition (a virtual machine) at https://www.sas.com/en_in/software/university‐edition.html or SAS on Demand at https://odamid.oda.sas.com/SASLogon/login (a software as a service running SAS in browser).

To install the SAS University Edition on your Virtual machine you can follow the following steps (I am using VMware Workstation for this):

  • Run your Virtual Machine and click on file.
  • Open and select SAS University Edition (the extension of the file should be .ova). You can provide a new name and storage path for your new Virtual Machine and then import.
  • Now, you need to initially run the virtual machine and use the link provided in the VM to connect to the SAS University Edition in your browser.

1.2 About R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. R was initially written by Robert Gentleman and Ross Ihaka.

1.2.1 The R Environment

From https://www.r‐project.org/about.html, R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:

  • an effective data handling and storage facility,
  • a suite of operators for calculations on arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • graphical facilities for data analysis and display either on‐screen or on hardcopy, and
  • a well‐developed, simple and effective programming language which includes conditionals, loops, user‐defined recursive functions and input and output facilities.

There are almost 14 000+ packages in R (https://www.rdocumentation.org). You can also look at specific views of packages (https://cran.r‐project.org/web/views is a task view like a bundle or cluster of packages with similar usage i.e. econometrics). For computationally‐intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

1.2.2 Installation of R

You can download and install R from https://www.r‐project.org (or specifically from https://cloud.r‐project.org for your operating system). You can then download and install the IDE RStudio from https://www.rstudio.com/products/rstudio/download/#download. Lastly, you can install any of 12 000+ packages (see https://cran.r‐project.org/web/views and https://www.rdocumentation.org) using install.packages(“PACKAGENAME”) from within R. These packages can be downloaded from the CRAN (Comprehensive R Archive Network).

Within https://www.datacamp.com/community/tutorials/r‐packages‐guide, R packages are collections of functions and datasets developed by the community. They increase the power of R either by improving existing base R functionalities, or by adding new ones. For example, you can use sqldf package to use SQL with R and RODBC package to connect to RDBMS databases.

In addition, an excellent resource is how to learn SAS for R users from the SAS Institute itself.

https://support.sas.com/edu/schedules.html?ctry=us&crs=SP4R

The e‐learning course is free as of October 2018. The course teaches the following:

  • how to read and write SAS programs
  • import various forms of data
  • subset and merge data tables including using SQL (by the Proc SQL procedure)
  • carry out iterative processing and simulate new data
  • create new variables and functions
  • create and enhance plots of all types
  • apply descriptive and inferential procedures, including regression, logistic regression, analysis of variance, stepwise model selection, and mixed models
  • conduct matrix algebra and statistical simulations in the interactive matrix language (IML)
  • call R from SAS to use as a complimentary resource.

1.3 Notable Points in SAS and R Languages

  1. Each line in SAS ends with; R does not have any such limitation
  2. SAS is case insensitive – ozone and OZone are the same thing. R is case sensitive.
  3. In SAS comments are within /* */ (press Ctrl + /). In R comments follow #
  4. SAS has two kinds of statements:
    1. Data Step which deals with input, manipulation of data and
      data ajay;
      set input;
      run;
      
    2. Proc Step which are procedural steps for analysis and output.

    R has functions and packages for similar functions bundled together

  5. SAS needs a license for extra functionality (e.g., for Time Series you needed SAS /ETS license) while R is free and extensible (forecast package for Time Series).

1.4 Some Important Functions with Comparative Comparisons Respectively

A Proc by Proc comparison in SAS language with R language functions is shown below. It will be explained in greater detail in later chapters. Some people consider R's smaller syntax helpful in coding while others consider SAS to be easier to learn and focus on analysis instead.

Function
SAS
R
Import data
proc import
read_csv (readr package)
Print data
proc print data=ajay;
run;
ajay
Structure of Data Object
proc contents data=ajay;
run;
str(ajay)
Frequency of Categorical Variables (Cross Tabulation)
proc freq data= ajay;
tables var1*var2;
run;
table(ajay$var1,ajay$var2)
Analysis of Numerical Variables without/with grouped by another variable
Proc means
Proc means data= ajay;
Var var1 var2;
Run;

Proc means daya=ajay;
Var var1 var2;
Class grp1;
run;
summary(ajay$var1,ajay$var2)

library(Hmisc)
summarize(ajay$var1,ajay$grp1,summary)

summarize(ajay$var1,ajay$grp1,summary)

1.5 Summary

In this chapter we have introduced R and SAS languages, and briefly compared their main functions/syntax.

1.6 Quiz Questions

  1. Who is the CEO of SAS?
  2. When was SAS founded?
  3. Where was SAS founded?
  4. Who designed R?
  5. When was R founded?
  6. Where was R founded?
  7. Which of the two languages has a better documentation and customer support?
  8. TASK: Suppose you know SQL. Can you identify functions or packages you can use in SAS and R respectively to run SQL commands?

Quiz Answers

  1. JAMES GOODNIGHT
  2. 1976
  3. NORTH CAROLINA STATE UNIVERSITY
  4. ROSS IHAKA AND ROBERT GENTLEMAN
  5. 1993
  6. UNIVERSITY OF AUCKLAND
  7. SAS
  8. IN SAS: use proc. sql, IN R: use sqldf package.