This edition first published 2018
© 2018 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Ajay Ohri to be identified as the author of this work has been asserted in accordance with law.
“Python” and the Python Logo are trademarks of the Python Software Foundation.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
The publisher and the authors make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties; including without limitation any implied warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for every situation. In view of on‐going research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or website is referred to in this work as a citation and/or potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this works was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising here from.
Library of Congress Cataloguing‐in‐Publication Data
Name: Ohri, A. (Ajay), author.
Title: Python® for R users : a data science approach / Ajay Ohri.
Description: Hoboken, NJ : John Wiley & Sons, 2018. | Includes bibliographical references and index. | Identifiers: LCCN 2017022045 (print) | LCCN 2017036415 (ebook) | ISBN 9781119126775 (pdf) | ISBN 9781119126782 (epub) | ISBN 9781119126768 (pbk.)
Subjects: LCSH: Python (Computer program language) | R (Computer program language)
Classification: LCC QA76.73.P98 (ebook) | LCC QA76.73.P98 O37 2017 (print) | DDC 005.13/3–dc23
LC record available at https://lccn.loc.gov/2017022045
Cover design: Wiley
Cover images: (Background) © Duncan Walker/iStockphoto
Dedicated to my family in Delhi, Mumbai, and the United States
and
Kush Ohri (my son whom I love very much)
and
Jesus Christ (my personal savior)
I started my career with selling cars in 2003. That was my first job after 2 years of MBA and 4 years of engineering. In addition, I took off 2 years to enter a military academy as an officer cadet (dropped out in 1 year) and as a physicist (dropped out after 1 year). Much later, I dropped out of my PhD Track (MS Stats) after 1 year in Knoxville. I did not do very well in statistics theory in my engineering, my MBA, or even my grad school. I was only interested in statistical software and fortunately I was not very bad at using it. So in 2004, I dropped out of selling cars and entered into writing statistical software for General Electric’s then India‐based offshore company.
I used a language called SAS for a software called Base SAS. The help provided by the software company called SAS for this software and language was quite nice, so it was nice to play with data and code all day and be paid to have fun. After a few years of job changes, I came across open‐source software when I started building my own start‐up. I really like SAS as a language and a company, but as a start‐up guy I could not afford it, and the SAS University Edition was not there in 2007. Since I needed money to pay for diapers of my baby Kush, and analysis was the only gift God had given me, I turned to R.
R, Open Office, and Ubuntu Linux were my first introduction to open‐source statistical computing, and I persevered in it. In 2007 I started my own start‐up in business analytics writing and consulting, Decisionstats.com. In 2009 I entered the University of Tennessee for a funded assistantship, I interned in Silicon Valley for a few weeks in the winter, and I dropped out on medical reasons after taking courses across multiple departments from graphics design and genetic algorithms from Computer Science Department, apart from Statistics Department. Cross‐domain training helped me a lot to think in various ways to give simple solutions, and I will always be thankful to the kind folks in Statistics and Computer Science Department of the University of Tennessee.
Once I mastered my brain around the vagaries of troubleshooting in Linux and of object‐oriented programming on R, I was good to go to give consulting projects for data analysis. Those days we used to call it business analytics, but today of course we call it data science.
Since I often forget things including where I kept my code, I started blogging on things that I felt were useful and might be useful to others. After a few years I discovered that in the real world it was not what I knew, but who I knew that really helped my career. So I began interviewing people in Analytics and R and my blog viewership took off. My blog philosophy continues to be—a blog post should be useful, it should be unique, and it should be interesting. In 2016, I had amassed 1,000,000 views on DecisionStats.com—again a surprising turn of events for me. I am most grateful to the 100 plus people who agreed to be interviewed by me.
2007 and 2008 were early days for analytics blogging for sure. After a few years I had enough material to put together a book and enough credibility to publish with a publisher. In 2012 I came up with my first book and in 2014 I came up with my second book. In 2016, the Chinese translation of my first book was realized. Surprisingly for me, a review of my second book appeared in the Journal of Statistical Software.
After publishing two books on R, mentoring many start‐ups by consulting and training, engaging consulting clients in real‐world problems, and making an established name in social media, I still felt I needed to learn more.
Data was getting bigger and bigger. It was not enough to know how to write small data analytics using a single machine in serialized code; perhaps it was time to write parallel code in multiple machines on big data analytics. Then there was the divide between statisticians and computer science that fascinated me since I see data as data, a problem to be solved. As Eric S. Raymond wrote in the Hacker’s attitude, “The world is full of interesting problems.”
Then there was temptation and intellectual appeal of an alternative to R, called Python, which came with batteries attached (allegedly).
Once my scientific curiosity was piqued, I started learning Python. I found Python was both very good and very bad compared with R. Its community has different sets of rules and behavior (which are always turbulent in the passionate world of open‐source developer ninjas). But the language itself was very different. I don’t care about the language. I love science. But if a person like me who at least knows how to code a wee bit in R found it so tough to redo the same thing in Python, I thought maybe others were facing this transitioning problem too. For big data and for some specific use cases, Python was better in terms of speed. Speed matters, no matter how much Moore’s law conspires with the either to make it easier for you to write code. R also seemed to turn into a language where all I did was import a package and run a function with tweaked parameters. As R became the scientific mainstream replacing SAS language, and SAS remained the enterprise statistical language, Python and how to write code in it became the thing for anonymous red hat hackers like me to venture delve and explore into.
As the Internet of people expands to Internet of things, I feel that budding data scientists should know at least two languages in analytics so they can be secure on career. This also gives enterprises an open choice on which software to prototype models and which software to deploy in production environments.
The author is grateful to many people working in both the Python and R community for making this book possible. He would especially like to thank Dr. Eric Siegel of Predictive Analytics Conference and John Sall of JMP. He would like to thank all his students in 2012–2016.
This book would not be done without the support from Madhur Batra for mentoring and logistical support. On a technical side, inputs and hard work from his interns Yashika and Chandan Routray (IIT Kharagpur) and his DecisionStats team helped him. His coresearcher F. Xavier provided invaluable help with case studies.
The scope of the book is to introduce Python as a platform for data science practitioners including aspiring budding data scientists. The book is aimed at people who know R coding at various levels of expertise, but even those who know no coding in no language may find some value in it. It is not aimed at members of research communities and research departments. The focus is on simple tutorials and actionable analytics, not theory. I have also tried to incorporate R code to give a compare and contrast approach to learners.
Introduction deals with Python and comparison with R. It also lists the functions and packages used in both languages. It also lists some managerial models that the author feels data scientists should be aware of. It introduces the reader to basics of Python and R language.
“Data Input” deals with an approach for people to get data of various volume variety and velocity in Python. This includes web scraping, databases, noSQL data, and spreadsheet like data.
“Data Inspection and Data Quality”—Data Inspection deals with choices in verifying data quality in Python.
“Exploratory Data Analysis” deals with basic data exploration and data summarization with rolling up data with group by criterion.
“Statistical Modeling” deals with creating models based on statistical analysis including OLS regression that are useful for industry to build propensity models.
“Data Visualization” deals with visual methods to inspect raw and rolled‐up data.
“Machine Learning Made Easier” deals with commonly used data mining methods for model building. This is done with an emphasis on both supervised and unsupervised methods and further emphasis on regression and clustering techniques. Time series forecasting helps the user with time series forecasting. Text mining deals with text mining methods and natural language processing. Web analytics looks at using Python for analyzing web data. Advanced data science looks at methods and techniques for newer age use cases including cloud computing‐enabled big data analysis, social network analysis, Internet of things, etc.
Conclusion and Summary—We list down what we learned and tried to achieve in this book, and our perspective for future growth of R and Python as well as statistical computing to grow, and render data science a credible foothold for the future.
The book has been written from a practical use case perspective for helping people navigate multiple open‐source languages in the pursuit of data science excellence. The author believes that there is no one software or language that can solve all kinds of data problems all the time. An optimized approach to learning is better than an ideological approach to learning statistical software. Past habits of thinking must be confronted to enhance speed of future knowledge enhancement.
I will continue to use screenshots as a tutorial device and I will draw upon my experience in data science consulting to highlight practical data parsing problems. This is because choosing the right tool and technique and even package is not so time consuming but the sheer variety of data and business problems can suck up the data scientist’s time that can later affect quality of his judgment and solution.
This is a book for budding data scientists and existing data scientists married to other languages like SPSS or R or Julia. I am trying to be practical about solving problems in data. Thus there will be very little theory.
I am focused on practical solutions. I will therefore proceed on the assumption that the user wants to do data science or analytics at the lowest cost and greatest accuracy, robustness, and ease possible. A true scientist always keeps his mind open to data and options regardless of who made whom. The author finds that information asymmetry and brand clutter have managed to confuse audiences of the true benefits of R versus Python versus other languages. The instructions and tutorials within this book have no warranty and you are doing so at your own risk.
As a special note on formatting of this manuscript, the author mostly writes on Google Docs, but here he is writing using the GUI LyX for the typesetting software LaTex, and he confesses he is not very good at it. We do hope the book is read by business users, technical users, CTOs keen to know more on R and Python and when to use open‐source analytics, and students wishing to enter a very nice career as data scientists. R is well known for excellent graphics but not so suitable for bigger datasets in its native straight to use open‐source version. Python is well known for being great with big datasets and flexibility but has always played catch‐up to the number of good statistical libraries as available in R.
The enterprise CTO can reduce costs incredibly by using open‐source software and hardware via blended cloud and blended open‐source software.
Tim Peters
Source: https://www.python.org/dev/peps/pep‐0020/