CONTENTS

Cover

Foreword by Gareth James

Foreword by Ravi Bapna

Preface to the Python Edition

Acknowledgments

Part I Preliminaries

Chapter 1 Introduction
1. 1.1 What Is Business Analytics?
2. 1.2 What Is Data Mining?
3. 1.3 Data Mining and Related Terms
4. 1.4 Big Data
5. 1.5 Data Science
6. 1.6 Why Are There So Many Different Methods?
7. 1.7 Terminology and Notation
8. 1.8 Road Maps to This Book
Chapter 2 Overview of the Data Mining Process
1. 2.1 Introduction
2. 2.2 Core Ideas in Data Mining
3. 2.3 The Steps in Data Mining
4. 2.4 Preliminary Steps
5. 2.5 Predictive Power and Overfitting
6. 2.6 Building a Predictive Model
7. 2.7 Using Python for Data Mining on a Local Machine
8. 2.8 Automating Data Mining Solutions
9. 2.9 Ethical Practice in Data Mining⁵
10. Problems
11. Notes

Part II Data Exploration and Dimension Reduction

Chapter 3 Data Visualization
1. 3.1 Introduction¹
2. 3.2 Data Examples
3. 3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots
4. 3.4 Multidimensional Visualization
5. 3.5 Specialized Visualizations
6. 3.6 Summary: Major Visualizations and Operations, by Data Mining Goal
7. Problems
8. Notes
Chapter 4 Dimension Reduction
1. 4.1 Introduction
2. 4.2 Curse of Dimensionality
3. 4.3 Practical Considerations
4. 4.4 Data Summaries
5. 4.5 Correlation Analysis
6. 4.6 Reducing the Number of Categories in Categorical Variables
7. 4.7 Converting a Categorical Variable to a Numerical Variable
8. 4.8 Principal Components Analysis
9. 4.9 Dimension Reduction Using Regression Models
10. 4.10 Dimension Reduction Using Classification and Regression Trees
11. Problems
12. Notes

Part III Performance Evaluation

Chapter 5 Evaluating Predictive Performance
1. 5.1 Introduction
2. 5.2 Evaluating Predictive Performance
3. 5.3 Judging Classifier Performance
4. 5.4 Judging Ranking Performance
5. 5.5 Oversampling
6. Problems
7. Notes

Part IV Prediction and Classification Methods

Chapter 6 Multiple Linear Regression
1. 6.1 Introduction
2. 6.2 Explanatory vs. Predictive Modeling
3. 6.3 Estimating the Regression Equation and Prediction
4. 6.4 Variable Selection in Linear Regression
5. Appendix: Using Statmodels
6. Problems
Chapter 7 k-Nearest Neighbors (k-NN)
1. 7.1 The k-NN Classifier (Categorical Outcome)
2. 7.2 k-NN for a Numerical Outcome
3. 7.3 Advantages and Shortcomings of k-NN Algorithms
4. Problems
5. Notes
Chapter 8 The Naive Bayes Classifier
1. 8.1 Introduction
2. 8.2 Applying the Full (Exact) Bayesian Classifier
3. 8.3 Advantages and Shortcomings of the Naive Bayes Classifier
4. Problems
Chapter 9 Classification and Regression Trees
1. 9.1 Introduction
2. 9.2 Classification Trees
3. 9.3 Evaluating the Performance of a Classification Tree
4. 9.4 Avoiding Overfitting
5. 9.5 Classification Rules from Trees
6. 9.6 Classification Trees for More Than Two Classes
7. 9.7 Regression Trees
8. 9.8 Improving Prediction: Random Forests and Boosted Trees
9. 9.9 Advantages and Weaknesses of a Tree
10. Problems
11. Notes
Chapter 10 Logistic Regression
1. 10.1 Introduction
2. 10.2 The Logistic Regression Model
3. 10.3 Example: Acceptance of Personal Loan
4. 10.4 Evaluating Classification Performance
5. 10.5 Logistic Regression for Multi-class Classification
6. 10.6 Example of Complete Analysis: Predicting Delayed Flights
7. Appendix: Using Statmodels
8. Problems
9. Notes
Chapter 11 Neural Nets
1. 11.1 Introduction
2. 11.2 Concept and Structure of a Neural Network
3. 11.3 Fitting a Network to Data
4. 11.4 Required User Input
5. 11.5 Exploring the Relationship Between Predictors and Outcome
6. 11.6 Deep Learning³
7. 11.7 Advantages and Weaknesses of Neural Networks
8. Problems
9. Notes
Chapter 12 Discriminant Analysis
1. 12.1 Introduction
2. 12.2 Distance of a Record from a Class
3. 12.3 Fisher’s Linear Classification Functions
4. 12.4 Classification Performance of Discriminant Analysis
5. 12.5 Prior Probabilities
6. 12.6 Unequal Misclassification Costs
7. 12.7 Classifying More Than Two Classes
8. 12.8 Advantages and Weaknesses
9. Problems
10. Notes
Chapter 13 Combining Methods: Ensembles and Uplift Modeling
1. 13.1 Ensembles¹
2. 13.2 Uplift (Persuasion) Modeling
3. 13.3 Summary
4. Problems
5. Notes

Part V Mining Relationships Among Records

Chapter 14 Association Rules and Collaborative Filtering
1. 14.1 Association Rules
2. 14.2 Collaborative Filtering
3. 14.3 Summary
4. Problems
5. Notes
Chapter 15 Cluster Analysis
1. 15.1 Introduction
2. 15.2 Measuring Distance Between Two Records
3. 15.3 Measuring Distance Between Two Clusters
4. 15.4 Hierarchical (Agglomerative) Clustering
5. 15.5 Non-Hierarchical Clustering: The k-Means Algorithm
6. Problems

Part VI Forecasting Time Series

Chapter 16 Handling Time Series
1. 16.1 Introduction¹
2. 16.2 Descriptive vs. Predictive Modeling
3. 16.3 Popular Forecasting Methods in Business
4. 16.4 Time Series Components
5. 16.5 Data-Partitioning and Performance Evaluation
6. Problems
7. Notes
Chapter 17 Regression-Based Forecasting
1. 17.1 A Model with Trend¹
2. 17.2 A Model with Seasonality
3. 17.3 A Model with Trend and Seasonality
4. 17.4 Autocorrelation and ARIMA Models
5. Problems
6. Notes
Chapter 18 Smoothing Methods
1. 18.1 Introduction¹
2. 18.2 Moving Average
3. 18.3 Simple Exponential Smoothing
4. 18.4 Advanced Exponential Smoothing
5. Problems
6. Notes

PART VII Data Analytics

Chapter 19 Social Network Analytics¹
1. 19.1 Introduction²
2. 19.2 Directed vs. Undirected Networks
3. 19.3 Visualizing and Analyzing Networks
4. 19.4 Social Data Metrics and Taxonomy
5. 19.5 Using Network Metrics in Prediction and Classification
6. 19.6 Collecting Social Network Data with Python
7. 19.7 Advantages and Disadvantages
8. Problems
9. Notes
Chapter 20 Text Mining
1. 20.1 Introduction¹
2. 20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words”
3. 20.3 Bag-of-Words vs. Meaning Extraction at Document Level
4. 20.4 Preprocessing the Text
5. 20.5 Implementing Data Mining Methods
6. 20.6 Example: Online Discussions on Autos and Electronics
7. 20.7 Summary
8. Problems
9. Notes

PART VIII Cases

Chapter 21 Cases
1. 21.1 Charles Book Club¹
2. 21.2 German Credit
3. 21.3 Tayko Software Cataloger³
4. 21.4 Political Persuasion⁴
5. 21.5 Taxi Cancellations⁵
6. 21.6 Segmenting Consumers of Bath Soap⁶
7. 21.7 Direct-Mail Fundraising
8. 21.8 Catalog Cross-Selling⁷
9. 21.9 Time Series Case: Forecasting Public Transportation Demand
10. Notes

References

Data Files Used in the Book

Python Utilities Functions

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1

Chapter 2

Table 2.1
Table 2.2
Table 2.3
Table 2.4
Table 2.5
Table 2.6
Table 2.7
Table 2.8
Table 2.9
Table 2.10
Table 2.11
Table 2.12
Table 2.13
Table 2.14
Table 2.15
Table 2.16
Table 2.17
Table 2.18

Chapter 3

Table 3.1
Table 3.2

Chapter 4

Table 4.1
Table 4.2
Table 4.3
Table 4.4
Table 4.5
Table 4.6
Table 4.7
Table 4.8
Table 4.9
Table 4.10
Table 4.11
Table 4.12
Table 4.13
Table 4.14

Chapter 5

Table 5.1
Table 5.2
Table 5.3
Table 5.4
Table 5.5
Table 5.6
Table 5.7

Chapter 6

Table 6.1
Table 6.2
Table 6.3
Table 6.4
Table 6.5
Table 6.6
Table 6.7
Table 6.8
Table 6.9
Table 6.10
Table 6.11
Table 6.12
Table 6.13

Chapter 7

Table 7.1
Table 7.2
Table 7.3
Table 7.4

Chapter 8

Table 8.1
Table 8.2
Table 8.3
Table 8.4
Table 8.5
Table 8.6
Table 8.7

Chapter 9

Table 9.1
Table 9.2
Table 9.3
Table 9.4
Table 9.5
Table 9.6
Table 9.7
Table 9.8
Table 9.9
Table 9.10

Chapter 10

Table 10.1
Table 10.2
Table 10.3
Table 10.4
Table 10.5
Table 10.6
Table 10.7
Table 10.8
Table 10.9
Table 10.10
Table 10.11

Chapter 11

Table 11.1
Table 11.2
Table 11.3
Table 11.4
Table 11.5
Table 11.6
Table 11.7

Chapter 12

Table 12.1
Table 12.2
Table 12.3
Table 12.4
Table 12.5

Chapter 13

Table 13.1
Table 13.2
Table 13.3
Table 13.4
Table 13.5
Table 13.6
Table 13.7
Table 13.8

Chapter 14

Table 14.1
Table 14.2
Table 14.3
Table 14.4
Table 14.5
Table 14.6
Table 14.7
Table 14.8
Table 14.9
Table 14.10
Table 14.11
Table 14.12
Table 14.13
Table 14.14
Table 14.15
Table 14.16
Table 14.17

Chapter 15

Table 15.1
Table 15.2
Table 15.3
Table 15.4
Table 15.5
Table 15.6
Table 15.7
Table 15.8
Table 15.9
Table 15.10
Table 15.11

Chapter 16

Table 16.1
Table 16.2

Chapter 17

Table 17.1
Table 17.2
Table 17.3
Table 17.4
Table 17.5
Table 17.6
Table 17.7
Table 17.8
Table 17.9
Table 17.10
Table 17.11

Chapter 18

Table 18.1
Table 18.2
Table 18.3

Chapter 19

Table 19.1
Table 19.2
Table 19.3
Table 19.4
Table 19.5
Table 19.6
Table 19.7
Table 19.8
Table 19.9
Table 19.10
Table 19.11

Chapter 20

Table 20.1
Table 20.2
Table 20.3
Table 20.4
Table 20.5
Table 20.6
Table 20.7
Table 20.8

Chapter 21

Table 21.1
Table 21.2
Table 21.3
Table 21.4
Table 21.5
Table 21.6
Table 21.7
Table 21.8
Table 21.9

List of Illustrations

Chapter 1

Figure 1.1 Two methods for separating owners from nonowners
Figure 1.2 Data mining from a process perspective. Numbers in Parentheses indicate chapter...
Figure 1.3 Jupyter notebook

Chapter 2

Figure 2.1 Schematic of the data modeling process
Figure 2.2 scatter plot for advertising and sales data
Figure 2.3 Overfitting: This function fits the data with no error
Figure 2.4 Three data partitions and their role in the data mining process

Chapter 3

Figure 3.1 Basic plots: line graph (a), scatter plot (b), bar chart for numerical variable...
Figure 3.2 Distribution charts for numerical variable MEDV. (a) Histogram, (b) Boxplot
Figure 3.3 Side-by-side boxplots for exploring the CAT.MEDV output variable by different n...
Figure 3.4 Heatmap of a correlation table. Darker values denote stronger correlation. Blue...
Figure 3.5 Heatmap of missing values in a dataset on motor vehicle collisions. Grey denote...
Figure 3.6 Adding categorical variables by color-coding and multiple panels. (a) Scatter p...
Figure 3.7 Scatter plot matrix for MEDV and three numerical predictors
Figure 3.8 Rescaling can enhance plots and reveal patterns. (a) original scale, (b) logari...
Figure 3.9 Time series line graphs using different aggregations (b,d), adding curves (a), ...
Figure 3.10 Scatter plot with labeled points
Figure 3.11 Scatter plot of Large Dataset with reduced marker size, jittering, and more tra...
Figure 3.12 Parallel coordinates plot for Boston Housing data. Each of the variables (shown...
Figure 3.13 Multiple inter-linked plots in a single view (using Spotfire). Note the marked ...
Figure 3.14 Network plot of eBay sellers (black circles) and buyers (grey circles) of Swaro...
Figure 3.15 Treemap showing nearly 11,000 eBay auctions, organized by item category, subcat...
Figure 3.16 Treemap showing nearly 11,000 eBay auctions, organized by item category. Rectan...
Figure 3.17 Map chart of students’ and instructors’ locations on a Google Map. Source: data...
Figure 3.18 World maps comparing “well-being” (a) to GDP (b). Shading by average “global we...

Chapter 4

Figure 4.1 Distribution of CAT.MEDV (blue denotes CAT.MEDV = 0) by ZN. Similar bars indica...
Figure 4.2 Quarterly Revenues of Toys “R” Us, 1992–1995
Figure 4.3 Scatter plot of rating vs. calories for 77 breakfast cereals, with the two ...
Figure 4.4 Scatter plot of the second vs. first principal components scores for the normal...

Chapter 5

Figure 5.1 Histograms and boxplots of Toyota price prediction errors, for training and val...
Figure 5.2 Cumulative gains chart (a) and decile lift chart (b) for continuous outcome var...
Figure 5.3 High (a) and low (b) levels of separation between two classes, using two predic...
Figure 5.4 Plotting accuracy and overall error as a function of the cutoff value (riding m...
Figure 5.5 ROC curve for riding mowers example
Figure 5.6 Cumulative gains chart for the mower example
Figure 5.7 Decile lift chart
Figure 5.8 Cumulative gains curve incorporating costs
Figure 5.9 Classification assuming equal costs of misclassification
Figure 5.10 Classification assuming unequal costs of misclassification
Figure 5.11 Classification using oversampling to account for unequal costs
Figure 5.12 Decile lift chart for transaction data
Figure 5.13 Cumulative gains (a) and decile lift charts (b) for software services product s...

Chapter 6

Figure 6.1 Histogram of model errors (based on validation set)

Chapter 7

Figure 7.1 Scatter plot of Lot Size vs. Income for the 14 training set households (sol...

Chapter 8

Figure 8.1 Cumulative gains chart of naive Bayes classifier applied to flight delays data

Chapter 9

Figure 9.1 Example of a tree for classifying bank customers as loan acceptors or nonaccept...
Figure 9.2 Scatter plot of Lot Size Vs. Income for 24 owners and nonowners of riding mower...
Figure 9.3 Splitting the 24 records by Income value of 59.7
Figure 9.4 Values of the Gini index for a two-class case as a function of the proportion o...
Figure 9.5 Splitting the 24 records first by Income value of 59.7 and then Lot size value ...
Figure 9.6 Final stage of recursive partitioning; each rectangle consisting of a single cl...
Figure 9.7 Tree representation of first split (corresponds to Figure 9.3)
Figure 9.8 Tree representation of first three splits.
Figure 9.9 Tree representation after all splits (corresponds to Figure 9.6). This is the f...
Figure 9.10 A full tree for the loan acceptance data using the training set (3000 records)
Figure 9.11 Error rate as a function of the number of splits for training vs. validation da...
Figure 9.12 Smaller classification tree for the loan acceptance data using the training set...
Figure 9.13 Fine-tuned classification tree for the loan acceptance data using the training ...
Figure 9.14 Fine-tuned regression tree for Toyota Corolla prices
Figure 9.15 Variable importance plot from Random forest (Personal Loan Example, for code se...
Figure 9.16 A two-predictor case with two classes. The best separation is achieved with a d...

Chapter 10

Figure 10.1 (a) Odds and (b) logit as a function of p
Figure 10.2 Plot of data points (Personal Loan as a function of Income) and the fitted logi...
Figure 10.3 Cumulative gains chart and decile-wise lift chart for the validation data for U...
Figure 10.4 Proportion of delayed flights by each of the six predictors. Time of day is div...
Figure 10.5 Percent of delayed flights (darker = higher %delays) by day of week, origin, an...
Figure 10.6 Confusion matrix and cumulative gains chart for the flight delays validation da...
Figure 10.7 Cumulative gains chart for logistic regression model with fewer predictors on t...

Chapter 11

Figure 11.1 Multilayer feedforward neural network
Figure 11.2 Neural network for the tiny example. Rectangles represent nodes (“neurons”), w...
Figure 11.3 Computing node outputs (values are on right side within each node) using the fi...
Figure 11.4 Neural network for the tiny example with final weights and bias values (values ...
Figure 11.5 Line drawing, from a 1893 Funk and Wagnalls publication
Figure 11.6 Focusing on the line of the man’s chin
Figure 11.7 3 × 3 pixels representation of line on man’s chin using shading (a) and values ...
Figure 11.8 Convolution network process, supervised learning: The repeated filtering in the...
Figure 11.9 Autoencoder network process: The repeated filtering in the network convolution ...

Chapter 12

Figure 12.1 Scatter plot of Lot Size vs. Income for 24 owners and nonowners of riding mower...
Figure 12.2 Personal loan acceptance as a function of income and credit card spending for 5...
Figure 12.3 Class separation obtained from the discriminant model (compared to ad hoc line ...

Chapter 14

Figure 14.1 Recommendations under “Frequently bought together” are based on association rul...

Chapter 15

Figure 15.1 Scatter plot of Fuel Cost vs. Sales for the 22 utilities
Figure 15.2 Two-dimensional representation of several different distance measures between P...
Figure 15.3 Dendrogram: single linkage (a) and average linkage (b) for all 22 utilities, us...
Figure 15.4 Heatmap for the 22 utilities (in rows). Rows are sorted by the six clusters fro...
Figure 15.5 Visual presentation (profile plot) of cluster centroids
Figure 15.6 Comparing different choices of k in terms of overall average within-cluster d...

Chapter 16

Figure 16.1 Monthly ridership on Amtrak trains (in thousands) from January 1991 to March 20...
Figure 16.2 Time plots of the daily number of vehicles passing through the Baregg tunnel, S...
Figure 16.3 Plots that enhance the different components of the time series. (a) Zoom-in to ...
Figure 16.4 Naive and seasonal naive forecasts in a 3-year validation set for Amtrak riders...
Figure 16.5 Average annual weekly hours spent by Canadian manufacturing workers

Chapter 17

Figure 17.1 A linear trend fit to Amtrak ridership
Figure 17.2 A linear trend fit to Amtrak ridership in the training period (a) and forecaste...
Figure 17.3 Exponential (green) and linear (orange) trend used to forecast Amtrak ridership...
Figure 17.4 Quadratic trend model used to forecast Amtrak ridership. Plots of fitted, forec...
Figure 17.5 Regression model with seasonality applied to the Amtrak ridership (a) and its f...
Figure 17.6 Regression model with trend and seasonality applied to Amtrak ridership (a) and...
Figure 17.7 Autocorrelation plot for lags 1–12 (for first 24 months of Amtrak ridership)
Figure 17.8 Autocorrelation plot of forecast errors series from Figure 17.6
Figure 17.9 Fitting an AR(1) model to the residual series from Figure 17.6
Figure 17.10 Autocorrelations of residuals-of-residuals series
Figure 17.11 Seasonally adjusted pre-September-11 AIR series
Figure 17.12 Average annual weekly hours spent by Canadian manufacturing workers
Figure 17.13 Quarterly Revenues of Toys “R” Us, 1992–1995
Figure 17.14 Daily close price of Walmart stock, February 2001–2002
Figure 17.15 Department store quarterly sales series
Figure 17.16 Fit of regression model for department store sales
Figure 17.17 Monthly sales at Australian souvenir shop in Australian dollars (a) and in log-...
Figure 17.18 Quarterly shipments of US household appliances over 5 years
Figure 17.19 Monthly sales of six types of Australian wines between 1980 and 1994

Chapter 18

Figure 18.1 Schematic of centered moving average (a) and trailing moving average (b), both ...
Figure 18.2 Centered moving average (smooth blue line) and trailing moving average (broken ...
Figure 18.3 Trailing moving average forecaster with w = 12 applied to Amtrak ridership se...
Figure 18.4 Output for simple exponential smoothing forecaster with = 0.2, applied to the...
Figure 18.5 Holt-Winters exponential smoothing applied to Amtrak ridership series
Figure 18.6 Seasonally adjusted pre-September-11 AIR series
Figure 18.7 Department store quarterly sales series
Figure 18.8 Forecasts and actuals (a) and forecast errors (b) using exponential smoothing
Figure 18.9 Quarterly shipments of US household appliances over 5 years
Figure 18.10 Monthly sales of a certain shampoo
Figure 18.11 Quarterly sales of natural gas over 4 years
Figure 18.12 Monthly sales of six types of Australian wines between 1980 and 1994

Chapter 19

Figure 19.1 Tiny hypothetical LinkedIn network; the edges represent connections among the m...
Figure 19.2 Tiny hypothetical Twitter network with directed edges (arrows) showing who foll...
Figure 19.3 Edge weights represented by line thickness, for example, bandwidth capacity bet...
Figure 19.4 Drug laundry network in San Antonio, TX
Figure 19.5 Two different layouts of the tiny LinkedIn network presented in Figure 19.1
Figure 19.6 The degree 1 (a) and degree 2 (b) egocentric networks for Peter, from the Linke...
Figure 19.7 A relatively sparse network
Figure 19.8 A relatively dense network
Figure 19.9 The networks for suspect A (a), suspect AA (b), and suspect AAA (c)
Figure 19.10 Network for link prediction exercise

Chapter 20

Figure 20.1 Decile-wise lift chart for autos-electronics document classification

Foreword by Gareth James

The field of statistics has existed in one form or another for 200 years, and by the second half of the 20th century had evolved into a well-respected and essential academic discipline. However, its prominence expanded rapidly in the 1990s with the explosion of new, and enormous, data sources. For the first part of this century, much of this attention was focused on biological applications, in particular, genetics data generated as a result of the sequencing of the human genome. However, the last decade has seen a dramatic increase in the availability of data in the business disciplines, and a corresponding interest in business-related statistical applications.

The impact has been profound. Ten years ago, when I was able to attract a full class of MBA students to my new statistical learning elective, my colleagues were astonished because our department struggled to fill most electives. Today, we offer a Masters in Business Analytics, which is the largest specialized masters program in the school and has application volume rivaling those of our MBA programs. Our department’s faculty size and course offerings have increased dramatically, yet the MBA students are still complaining that the classes are all full. Google’s chief economist, Hal Varian, was indeed correct in 2009 when he stated that “the sexy job in the next 10 years will be statisticians.”

This demand is driven by a simple, but undeniable, fact. Business analytics solutions have produced significant and measurable improvements in business performance, on multiple dimensions and in numerous settings, and as a result, there is a tremendous demand for individuals with the requisite skill set. However, training students in these skills is challenging given that, in addition to the obvious required knowledge of statistical methods, they need to understand business-related issues, possess strong communication skills, and be comfortable dealing with multiple computational packages. Most statistics texts concentrate on abstract training in classical methods, without much emphasis on practical, let alone business, applications.

This book has by far the most comprehensive review of business analytics methods that I have ever seen, covering everything from classical approaches such as linear and logistic regression, through to modern methods like neural networks, bagging and boosting, and even much more business specific procedures such as social network analysis and text mining. If not the bible, it is at the least a definitive manual on the subject. However, just as important as the list of topics, is the way that they are all presented in an applied fashion using business applications. Indeed the last chapter is entirely dedicated to 10 separate cases where business analytics approaches can be applied.

In this latest edition, the authors have added support for Python, a programming language that is rapidly gaining popularity among data scientists. The book provides detailed descriptions and code involving applications of Python in numerous business settings, ensuring that the reader will actually be able to apply their knowledge to real-life problems. I’m confident that this book will be an indispensable tool for any business analytics course using Python.

We recently introduced a business analytics course into our required MBA core curriculum and I intend to make heavy use of this book in developing the syllabus. I’m confident that it will be an indispensable tool for any such course.

GARETH JAMES

Marshall School of Business, University of Southern California, 2019

Foreword by Ravi Bapna

Data is the new gold—and mining this gold to create business value in today’s context of a highly networked and digital society requires a skillset that we haven’t traditionally delivered in business or statistics or engineering programs on their own. For those businesses and organizations that feel overwhelmed by today’s Big Data, the phrase you ain’t seen nothing yet comes to mind. Yesterday’s three major sources of Big Data—the 20+ years of investment in enterprise systems (ERP, CRM, SCM, etc.), the 3 billion plus people on the online social grid, and the close to 5 billion people carrying increasingly sophisticated mobile devices—are going to be dwarfed by tomorrow’s smarter physical ecosystems fueled by the Internet of Things (IoT) movement.

The idea that we can use sensors to connect physical objects such as homes, automobiles, roads, even garbage bins and streetlights, to digitally optimized systems of governance goes hand in glove with bigger data and the need for deeper analytical capabilities. We are not far away from a smart refrigerator sensing that you are short on, say, eggs, populating your grocery store’s mobile app’s shopping list, and arranging a Task Rabbit to do a grocery run for you. Or the refrigerator negotiating a deal with an Uber driver to deliver an evening meal to you. Nor are we far away from sensors embedded in roads and vehicles that can compute traffic congestion, track roadway wear and tear, record vehicle use and factor these into dynamic usage-based pricing, insurance rates, and even taxation. This brave new world is going to be fueled by analytics and the ability to harness data for competitive advantage.

Business Analytics is an emerging discipline that is going to help us ride this new wave. This new Business Analytics discipline requires individuals who are grounded in the fundamentals of business such that they know the right questions to ask, who have the ability to harness, store, and optimally process vast datasets from a variety of structured and unstructured sources, and who can then use an array of techniques from machine learning and statistics to uncover new insights for decision-making. Such individuals are a rare commodity today, but their creation has been the focus of this book for a decade now. This book’s forte is that it relies on explaining the core set of concepts required for today’s business analytics professionals using real-world data-rich cases in a hands-on manner, without sacrificing academic rigor. It provides a modern day foundation for Business Analytics, the notion of linking the x’s to the y’s of interest in a predictive sense. I say this with the confidence of someone who was probably the first adopter of the zeroth edition of this book (Spring 2006 at the Indian School of Business).

After the publication of the R edition in 2018, the new Python edition is an important addition. Python is gaining in popularity among analytics professionals, and the two open source languages constitute the primary statistical modeling and machine learning programming environments in data science.

I look forward to using the book in multiple fora, in executive education, in MBA classrooms, in MS-Business Analytics programs, and in Data Science bootcamps. I trust you will too!

RAVI BAPNA

Carlson School of Management, University of Minnesota, 2019

Preface to the Python Edition

This textbook first appeared in early 2007 and has been used by numerous students and practitioners and in many courses, including our own experience teaching this material both online and in person for more than 15 years. The first edition, based on the Excel add-in Analytic Solver Data Mining (previously XLMiner), was followed by two more Analytic Solver editions, a JMP edition, an R edition, and now this Python edition, with its companion website, www.dataminingbook.com.

This new Python edition, which relies on the free and open-source Python programming language, presents output from Python, as well as the code used to produce that output, including specification of the appropriate packages and functions, the dominant one being scikit-learn. Unlike computer-science or statistics-oriented textbooks, the focus in this book is on data mining concepts, and how to implement the associated algorithms in Python. We assume a basic familiarity with Python.

For this Python edition, a new co-author, Peter Gedeck comes on board bringing extensive data science experience in business. In addition to providing Python code and output, this edition also incorporates updates and new material based on feedback from instructors teaching MBA, MS, undergraduate, diploma, and executive courses, and from their students as well. Importantly, this edition includes for the first time an extended section on Data Ethics (Section 2.9).

A note about the book’s title: The first two editions of the book used the title Data Mining for Business Intelligence. Business Intelligence today refers mainly to reporting and data visualization (“what is happening now”), while Business Analytics has taken over the “advanced analytics,” which include predictive analytics and data mining. In this new edition, we therefore use the updated terms.

This Python edition includes the material that was recently added in the third edition of the original (Analytic Solver based) book:

Social network analysis
Text mining
Ensembles
Uplift modeling
Collaborative filtering

Since the appearance of the (Analytic Solver based) second edition, the landscape of the courses using the textbook has greatly expanded: whereas initially, the book was used mainly in semester-long elective MBA-level courses, it is now used in a variety of courses in Business Analytics degrees and certificate programs, ranging from undergraduate programs, to post-graduate and executive education programs. Courses in such programs also vary in their duration and coverage. In many cases, this textbook is used across multiple courses. The book is designed to continue supporting the general “Predictive Analytics” or “Data Mining” course as well as supporting a set of courses in dedicated business analytics programs.

A general “Business Analytics,” “Predictive Analytics,” or “Data Mining” course, common in MBA and undergraduate programs as a one-semester elective, would cover Parts I–III, and choose a subset of methods from Parts IV and V. Instructors can choose to use cases as team assignments, class discussions, or projects. For a two-semester course, Part VI might be considered, and we recommend introducing the new Part VII (Data Analytics).

For a set of courses in a dedicated business analytics program, here are a few courses that have been using our book:

Predictive Analytics—Supervised Learning: In a dedicated Business Analytics program, the topic of Predictive Analytics is typically instructed across a set of courses. The first course would cover Parts I–IV and instructors typically choose a subset of methods from Part IV according to the course length. We recommend including the Chapter 13 on ensembles in such a course, as well as “Part VII: Data Analytics.”
Predictive Analytics—Unsupervised Learning: This course introduces data exploration and visualization, dimension reduction, mining relationships, and clustering (Parts III and V). If this course follows the Predictive Analytics: Supervised Learning course, then it is useful to examine examples and approaches that integrate unsupervised and supervised learning, such as the new part on “Data Analytics.”
Forecasting Analytics: A dedicated course on time series forecasting would rely on Part VI.
Advanced Analytics: A course that integrates the learnings from Predictive Analytics (supervised and unsupervised learning). Such a course can focus on Part VII: Data Analytics, where social network analytics and text mining are introduced. Some instructors choose to use the Cases (Chapter 21) in such a course.

In all courses, we strongly recommend including a project component, where data are either collected by students according to their interest or provided by the instructor (e.g., from the many data mining competition datasets available). From our experience and other instructors’ experience, such projects enhance the learning and provide students with an excellent opportunity to understand the strengths of data mining and the challenges that arise in the process.

GALIT SHMUELI, PETER C. BRUCE, PETER GEDECK, AND NITIN R. PATEL

2019

Acknowledgments

We thank the many people who assisted us in improving the book from its inception as Data Mining for Business Intelligence in 2006 (using XLMiner, now Analytic Solver), through the recent editions now called Data Mining for Business Analytics, including two later XLMiner editions, a JMP edition, an R edition, and now for the first time, a Python edition.

Anthony Babinec, who has been using earlier editions of this book for years in his data mining courses at Statistics.com, provided us with detailed and expert corrections. Dan Toy and John Elder IV greeted our project with early enthusiasm and provided detailed and useful comments on initial drafts. Ravi Bapna, who used an early draft in a data mining course at the Indian School of Business and later at University of Minnesota, has provided invaluable comments and helpful suggestions since the book’s start.

Many of the instructors, teaching assistants, and students using earlier editions of the book have contributed invaluable feedback both directly and indirectly, through fruitful discussions, learning journeys, and interesting data mining projects that have helped shape and improve the book. These include MBA students from the University of Maryland, MIT, the Indian School of Business, National Tsing Hua University, and Statistics.com. Instructors from many universities and teaching programs, too numerous to list, have supported and helped improve the book since its inception. Scott Nestler has been a helpful friend of this book project from the beginning.

Kuber Deokar, instructional operations supervisor at Statistics.com, has been unstinting in his assistance, support, and detailed attention. We also thank Anuja Kulkarni, assistant teacher at Statistics.com. Valerie Troiano has shepherded many instructors and students through the Statistics.com courses that have helped nurture the development of these books.

Colleagues and family members have been providing ongoing feedback and assistance with this book project. Boaz Shmueli and Raquelle Azran gave detailed editorial comments and suggestions on the first two editions; Bruce McCullough and Adam Hughes did the same for the first edition. Noa Shmueli provided careful proofs of the third edition. Ran Shenberger offered design tips. Che Lin and Boaz Shmueli provided feedback on Deep Learning. Ken Strasma, founder of the microtargeting firm HaystaqDNA and director of targeting for the 2004 Kerry campaign and the 2008 Obama campaign, provided the scenario and data for the section on uplift modeling. We also thank Jen Golbeck, director of the Social Intelligence Lab at the University of Maryland and author of Analyzing the Social Web, whose book inspired our presentation in the chapter on social network analytics. Randall Pruim contributed extensively to the chapter on visualization. Inbal Yahav, co-author of the R edition, helped improve the social network analytics and text mining chapters.

Marietta Tretter at Texas A&M shared comments and thoughts on the time series chapters, and Stephen Few and Ben Shneiderman provided feedback and suggestions on the data visualization chapter and overall design tips.

Susan Palocsay and Mia Stephens have provided suggestions and feedback on numerous occasions, as have Margret Bjarnadottir, and, specifically for this Python edition, Mohammad Salehan. We also thank Catherine Plaisant at the University of Maryland’s Human–Computer Interaction Lab, who helped out in a major way by contributing exercises and illustrations to the data visualization chapter. Gregory Piatetsky-Shapiro, founder of KDNuggets.com, has been generous with his time and counsel in the early years of this project.

We thank colleagues at the MIT Sloan School of Management for their support during the formative stage of this book—Dimitris Bertsimas, James Orlin, Robert Freund, Roy Welsch, Gordon Kaufmann, and Gabriel Bitran. As teaching assistants for the data mining course at Sloan, Adam Mersereau gave detailed comments on the notes and cases that were the genesis of this book, Romy Shioda helped with the preparation of several cases and exercises used here, and Mahesh Kumar helped with the material on clustering.

Colleagues at the University of Maryland’s Smith School of Business: Shrivardhan Lele, Wolfgang Jank, and Paul Zantek provided practical advice and comments. We thank Robert Windle, and University of Maryland MBA students Timothy Roach, Pablo Macouzet, and Nathan Birckhead for invaluable datasets. We also thank MBA students Rob Whitener and Daniel Curtis for the heatmap and map charts.

Anand Bodapati provided both data and advice. Jake Hofman from Microsoft Research and Sharad Borle assisted with data access. Suresh Ankolekar and Mayank Shah helped develop several cases and provided valuable pedagogical comments. Vinni Bhandari helped write the Charles Book Club case.

We would like to thank Marvin Zelen, L. J. Wei, and Cyrus Mehta at Harvard, as well as Anil Gore at Pune University, for thought-provoking discussions on the relationship between statistics and data mining. Our thanks to Richard Larson of the Engineering Systems Division, MIT, for sparking many stimulating ideas on the role of data mining in modeling complex systems. Over two decades ago, they helped us develop a balanced philosophical perspective on the emerging field of data mining.

Lastly, we thank the folks at Wiley for the decade-long successful journey of this book. Steve Quigley at Wiley showed confidence in this book from the beginning and helped us navigate through the publishing process with great speed. Curt Hinrichs’ vision, tips, and encouragement helped bring this book to the starting gate. Sarah Keegan, Mindy Okura-Marszycki, Jon Gurstelle, Kathleen Santoloci, and Katrina Maceda greatly assisted us in pushing ahead and finalizing this Python edition. We are also especially grateful to Amy Hendrickson, who assisted with typesetting and making this book beautiful.

DATA MINING
FOR BUSINESS ANALYTICS

Concepts, Techniques, and Applications in Python

Foreword by Gareth James

Foreword by Ravi Bapna

Preface to the Python Edition

Acknowledgments

Part I
Preliminaries