This edition first published 2020
© 2020 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Ajay Ohri to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: Ohri, (Ajay), author.
Title: SAS for R users : a book for data scientists / Ajay Ohri.
Description: First edition. | Hoboken, NJ : John Wiley & Sons, Inc., 2020. | Includes bibliographical references and index.
Identifiers: LCCN 2019021408 (print) | ISBN 9781119256410 (pbk.)
Subjects: LCSH: SAS (Computer program language) | R (Computer program language) | Statistics–Data processing.
Classification: LCC QA76.73.S27 O44 2020 (print) | LCC QA76.73.S27 (ebook) | DDC 005.5/5–dc23
LC record available at https://lccn.loc.gov/2019021408
LC ebook record available at https://lccn.loc.gov/2019980765
Cover Design: Wiley
Cover Image: © DmitriyRazinkov/Shutterstock
This book is dedicated to my students and my family, my son Kush Ohri, members of my church, and my God Jesus Christ.
I would like to thank the generosity of the SAS Institute and its employees to provide SAS On Demand for Academics for free without whom this book would not exist. In addition, I also want to thank the baristas from Starbucks Gurgaon. These are the people who downvote my questions on Stackoverflow. You inspire me guys.
SAS for R users is aimed at entry‐level data scientists. It is not aimed at researchers in academia nor is it aimed at high‐ end data scientists working on Big Data, deep learning, or machine learning. In short, it is merely aimed at human learning business analytics (or data science as it is now called).
Both SAS and R are widely used languages and yet both are very different. SAS is a programming language that was designed in the 1960s which is broadly divided into Data Steps and a wide variety of Procedure or PROC steps, while R is an object oriented, mostly functional, language designed in the 1990s.
There are many, many books covering either but only very few books covering both.
Why then write the book? After all, I have written two books on R, and one on Python for R. SAS language remains the most widely used language in enterprises, contributing directly to the brand name, and profitability of one of the largest private software companies that invests hugely in its own research instead of borrowing research in the name of open source. A statistics student knowing Python (esp Machine Learning ML), R, SAS, Big Data (esp Spark ML), Data Visualization (using Tableau) is a mythical unicorn unavailable to recruiters who often have to settle for a few of these skills and then train them in house.
As a teacher, I want my students to have jobs – there is no ideological tilt to open source or any company here. The probability of students getting jobs from campus greatly increases if they know BOTH SAS and R not just one of them. That is why this book has been written.
This book is designed for professionals and students; people who want to enter data science and who have a coding background with some basics of statistical information. It is not aimed at researchers or people who like giraffes and do not read the book from the beginning.
Here is a brief introduction about R and SAS,instructions about installations and a broad high‐level comparison.
SAS used to be called the Statistical Analysis System Software suite developed by the SAS Institute for advanced analytics, business intelligence, data management, and predictive analytics. Developed at North Carolina State University from 1966 until 1976, when the SAS Institute was incorporated. It was then further developed in the 1980s and 1990s with the additional statistical procedures and components. SAS is a language, a software suite and a company created by Anthony James Barr and James Goodnight along with two others. For purposes of this book we will use SAS for SAS computer language.
While a graduate student in statistics at North Carolina State University, James Goodnight wrote a computer program for analyzing agricultural data. After a few years, James's application had attracted a diverse and loyal following among its users, and the program's data management and reporting capabilities had expanded beyond James's original intentions.
In 1976, he decided to work at developing and marketing his product on a full‐time basis, and the SAS Institute was founded. Since its beginning, a distinguishing feature of the company has been its attentiveness to users of the software. Today, the SAS Institute is the world's largest privately‐held software company, and Dr. James Goodnight is its CEO. He continues to be actively involved as a developer of SAS System software as well as being one of the most widely respected CEOs in the community.
The SAS System has more than 200 components
The SAS University Edition includes the SAS products Base SAS®, SAS/STAT®, SAS/IML®, SAS/ACCESS® Interface to PC Files, and SAS Studio. SAS has an annual license fee and almost 98% return to SAS every year, voting by their chequebook. All these products are Copyright © SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513, USA. (https://decisionstats.com/2009/08/20/the‐top‐decisionstats‐articles‐part‐1‐analytics/and https://en.wikipedia.org/wiki/SAS_(software))
While SAS Software for Enterprises is priced at an annual license, for students, researchers and learners you can choose from the SAS University Edition (a virtual machine) at https://www.sas.com/en_in/software/university‐edition.html or SAS on Demand at https://odamid.oda.sas.com/SASLogon/login (a software as a service running SAS in browser).
To install the SAS University Edition on your Virtual machine you can follow the following steps (I am using VMware Workstation for this):
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. R was initially written by Robert Gentleman and Ross Ihaka.
From https://www.r‐project.org/about.html, R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:
There are almost 14 000+ packages in R (https://www.rdocumentation.org). You can also look at specific views of packages (https://cran.r‐project.org/web/views is a task view like a bundle or cluster of packages with similar usage i.e. econometrics). For computationally‐intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.
You can download and install R from https://www.r‐project.org (or specifically from https://cloud.r‐project.org for your operating system). You can then download and install the IDE RStudio from https://www.rstudio.com/products/rstudio/download/#download. Lastly, you can install any of 12 000+ packages (see https://cran.r‐project.org/web/views and https://www.rdocumentation.org) using install.packages(“PACKAGENAME”) from within R. These packages can be downloaded from the CRAN (Comprehensive R Archive Network).
Within https://www.datacamp.com/community/tutorials/r‐packages‐guide, R packages are collections of functions and datasets developed by the community. They increase the power of R either by improving existing base R functionalities, or by adding new ones. For example, you can use sqldf package to use SQL with R and RODBC package to connect to RDBMS databases.
In addition, an excellent resource is how to learn SAS for R users from the SAS Institute itself.
https://support.sas.com/edu/schedules.html?ctry=us&crs=SP4R
The e‐learning course is free as of October 2018. The course teaches the following:
data ajay;
set input;
run;
R has functions and packages for similar functions bundled together
A Proc by Proc comparison in SAS language with R language functions is shown below. It will be explained in greater detail in later chapters. Some people consider R's smaller syntax helpful in coding while others consider SAS to be easier to learn and focus on analysis instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In this chapter we have introduced R and SAS languages, and briefly compared their main functions/syntax.