Cover
Introduction
1. WHAT DOES THIS BOOK COVER?
2. READER SUPPORT FOR THIS BOOK
PART I: Getting Started
1. Chapter 1: What Is Machine Learning?
  1. DISCOVERING KNOWLEDGE IN DATA
  2. MACHINE LEARNING TECHNIQUES
  3. MODEL SELECTION
  4. MODEL EVALUATION
  5. EXERCISES
2. Chapter 2: Introduction to R and RStudio
  1. WELCOME TO R
  2. R AND RSTUDIO COMPONENTS
  3. WRITING AND RUNNING AN R SCRIPT
  4. DATA TYPES IN R
  5. EXERCISES
3. Chapter 3: Managing Data
  1. THE TIDYVERSE
  2. DATA COLLECTION
  3. DATA EXPLORATION
  4. DATA PREPARATION
  5. EXERCISES
PART II: Regression
1. Chapter 4: Linear Regression
  1. BICYCLE RENTALS AND REGRESSION
  2. RELATIONSHIPS BETWEEN VARIABLES
  3. SIMPLE LINEAR REGRESSION
  4. MULTIPLE LINEAR REGRESSION
  5. CASE STUDY: PREDICTING BLOOD PRESSURE
  6. EXERCISES
2. Chapter 5: Logistic Regression
  1. PROSPECTING FOR POTENTIAL DONORS
  2. CLASSIFICATION
  3. LOGISTIC REGRESSION
  4. CASE STUDY: INCOME PREDICTION
  5. EXERCISES
PART III: Classification
1. Chapter 6: k-Nearest Neighbors
  1. DETECTING HEART DISEASE
  2. k-NEAREST NEIGHBORS
  3. CASE STUDY: REVISITING THE DONOR DATASET
  4. EXERCISES
2. Chapter 7: Naïve Bayes
  1. CLASSIFYING SPAM EMAIL
  2. NAÏVE BAYES
  3. CASE STUDY: REVISITING THE HEART DISEASE DETECTION PROBLEM
  4. EXERCISES
3. Chapter 8: Decision Trees
  1. PREDICTING BUILD PERMIT DECISIONS
  2. DECISION TREES
  3. CASE STUDY: REVISITING THE INCOME PREDICTION PROBLEM
  4. EXERCISES
PART IV: Evaluating and Improving Performance
1. Chapter 9: Evaluating Performance
  1. ESTIMATING FUTURE PERFORMANCE
  2. BEYOND PREDICTIVE ACCURACY
  3. VISUALIZING MODEL PERFORMANCE
  4. EXERCISES
2. Chapter 10: Improving Performance
  1. PARAMETER TUNING
  2. ENSEMBLE METHODS
  3. EXERCISES
PART V: Unsupervised Learning
1. Chapter 11: Discovering Patterns with Association Rules
  1. MARKET BASKET ANALYSIS
  2. ASSOCIATION RULES
  3. DISCOVERING ASSOCIATION RULES
  4. CASE STUDY: IDENTIFYING GROCERY PURCHASE PATTERNS
  5. EXERCISES
  6. NOTES
2. Chapter 12: Grouping Data with Clustering
  1. CLUSTERING
  2. k-MEANS CLUSTERING
  3. SEGMENTING COLLEGES WITH -MEANS CLUSTERING
  4. CASE STUDY: SEGMENTING SHOPPING MALL CUSTOMERS
  5. EXERCISES
  6. NOTE
Index
End User License Agreement

List of Illustrations

Chapter 1
1. Figure 1.1 Algorithm for crossing the street
2. Figure 1.2 The relationship between artificial intelligence, machine learnin...
3. Figure 1.3 Generic supervised learning model
4. Figure 1.4 Making predictions with a supervised learning model
5. Figure 1.5 Using machine learning to classify car dealership customers
6. Figure 1.6 Dataset of past customer loan repayment behavior
7. Figure 1.7 Applying the machine learning model
8. Figure 1.8 Strategically placing items in a grocery store based on unsupervi...
9. Figure 1.9 Error types
10. Figure 1.10 Residual error
11. Figure 1.11 The bias/variance trade-off
12. Figure 1.12 Underfitting, overfitting, and optimal fit
13. Figure 1.13 Holdout method
14. Figure 1.14 Cross-validation method
Chapter 2
1. Figure 2.1 Growth of the number of CRAN packages over time
2. Figure 2.2 Comprehensive R Archive Network (CRAN) mirror site
3. Figure 2.3 RStudio Desktop offers an IDE for Windows, Mac, and Linux systems...
4. Figure 2.4 RStudio Server provides a web-based IDE for collaborative use.
5. Figure 2.5 RStudio Desktop without a script open
6. Figure 2.6 RStudio Desktop with the console pane highlighted
7. Figure 2.7 Console pane executing several simple R commands
8. Figure 2.8 Accessing the Mac terminal in RStudio
9. Figure 2.9 Chick weight script inside the RStudio IDE
10. Figure 2.10 Graph produced by the chick weight script
11. Figure 2.11 Chick weight script inside a text editor
12. Figure 2.12 RStudio environment pane populated with data
13. Figure 2.13 RStudio History pane showing previously executed commands
14. Figure 2.14 The Files tab in RStudio allows you to interact with your device...
15. Figure 2.15 The Packages tab in RStudio allows you to view and manage the pa...
16. Figure 2.16 The Help tab in RStudio displaying documentation for the insta...
17. Figure 2.17 Hadley Wickham on the distinction between packages and libraries...
18. Figure 2.18 RStudio displaying the programming vignette from the dplyr p...
19. Figure 2.19 The Run button in RStudio runs the current section of code.
20. Figure 2.20 The Source button in RStudio runs the entire script.
Chapter 3
1. Figure 3.1 Simple spreadsheet containing data in tabular form
2. Figure 3.2 CSV file containing the same data as the spreadsheet in Figure 3....
3. Figure 3.3 TSV file containing the same data as the spreadsheet in Figure 3....
4. Figure 3.4 Pipe-delimited file containing the same data as the spreadsheet i...
5. Figure 3.5 Sample dataset illustrating the instances and features (independe...
6. Figure 3.6 Box plot of CO₂ emissions by vehicle class
7. Figure 3.7 Scatterplot of CO₂ emissions versus city gas mileage
8. Figure 3.8 Histogram of CO₂ emissions
9. Figure 3.9 Stacked bar chart of drive type composition by year
10. Figure 3.10 Illustration of the smoothing by clustering approach, on 14 inst...
11. Figure 3.11 Illustration of the smoothing by regression approach on 14 insta...
Chapter 4
1. Figure 4.1 Scatterplots illustrating the relationship between the dependent ...
2. Figure 4.2 Estimated regression line and actual values for a sample (n=20) o...
3. Figure 4.3 For our regression line, the differences between each actual valu...
4. Figure 4.4 (a) Residual histogram showing normality of residuals, (b) residu...
5. Figure 4.5 Residual versus fitted value plots illustrating heteroscedasticit...
6. Figure 4.6 Cook's Distance chart showing the influential points in the bikes...
7. Figure 4.7 Linear regression fit for each of the predictor variables (humidi...
8. Figure 4.8 The systolic blood pressure data for this population appears to b...
9. Figure 4.9 Distributions of dependent variables in the health dataset
10. Figure 4.10 Histogram of residuals produced using the ols_plot_resid_hist(...
11. Figure 4.11 Scatterplot of residuals produced using the ols_plot_resid_fit()
12. Figure 4.12 Cook's distance chart for the health dataset produced using the
Chapter 5
1. Figure 5.1 Fitted line for probability of respondedMailing using a straight...
2. Figure 5.2 Histogram showing the distribution of values for the mailOrderPur...
3. Figure 5.3 Histogram showing the distribution of values for the mailOrderPur...
4. Figure 5.4 Correlation matrix of the numeric features in the donors dataset...
Chapter 6
1. Figure 6.1 Scatterplot of age versus cholesterol levels for a sampling of 20...
2. Figure 6.2 Scatterplot of age versus cholesterol levels for a sampling of 20...
3. Figure 6.3 The impact of a large value for k (a) and a small value for k (b)...
4. Figure 6.4 The predictive accuracy of our model for values of k-nearest neig...
Chapter 8
1. Figure 8.1 Structure of a decision tree
2. Figure 8.2 Scatterplot of annual income versus loan amount for 30 commercial...
3. Figure 8.3 Bank customers partitioned on loan amount of less than or more th...
4. Figure 8.4 Bank customers partitioned on loan amount of less than or more th...
5. Figure 8.5 Decision tree of bank customers based on the loan amount and annu...
6. Figure 8.6 Candidate features for splitting the partition of customers who b...
7. Figure 8.7 Visualization of a decision tree model using the rpart.plot() fun...
8. Figure 8.8 Classification tree to predict customer income level
Chapter 9
1. Figure 9.1 Model build and evaluation process using all of the observed data...
2. Figure 9.2 Model build and evaluation process using subsets of the observed ...
3. Figure 9.3 Model build and evaluation process using the training and validat...
4. Figure 9.4 The k-fold cross-validation approach with k=5 (5-fold cross valid...
5. Figure 9.5 The leave-one-out cross-validation approach (LOOCV). A set of n e...
6. Figure 9.6 The random cross-validation approach. The training and validation...
7. Figure 9.7 The bootstrap sampling approach. The training set is created by r...
8. Figure 9.8 A sample confusion matrix showing actual versus predicted values...
9. Figure 9.9 Spam filter confusion matrix
10. Figure 9.10 (a) Precision as a measure of model performance based on (b) the...
11. Figure 9.11 (a) Recall as a measure of model performance based on (b) the sp...
12. Figure 9.12 (a) Sensitivity as a measure of model performance based on (b) t...
13. Figure 9.13 (a) Specificity as a measure of model performance based on (b) t...
14. Figure 9.14 The ROC curve for a sample classifier
15. Figure 9.15 The ROC curve for a sample classifier, a perfect classifier, and...
16. Figure 9.16 ROC curve for the spam filter example generated with R
17. Figure 9.17 ROC curve for two classifiers with similar AUC values
18. Figure 9.18 ROC curve for three different classifiers
Chapter 10
1. Figure 10.1 The grid search process showing eight models with different para...
2. Figure 10.2 Tunable parameters supported by the caret package for the rpart ...
3. Figure 10.3 The bagging ensemble features independently trained homogenous m...
4. Figure 10.4 The boosting ensemble features a linear sequence of homogenous m...
5. Figure 10.5 The stacking ensemble features independently trained heterogeneo...
Chapter 11
1. Figure 11.1 Sample market basket dataset showing five different transactions...
2. Figure 11.2 An association rule describing that whenever both beer and milk ...
3. Figure 11.3 All possible itemsets (itemset lattice) derived from items A, B,...
Chapter 12
1. Figure 12.1 Simulated dataset showing previously unlabeled items (a). The sa...
2. Figure 12.2 Hierarchical versus partitional clustering
3. Figure 12.3 Overlapping versus exclusive clustering
4. Figure 12.4 Complete versus partial clustering
5. Figure 12.5 The initial centroids are randomly chosen (a), and every item is...
6. Figure 12.6 New cluster centers are chosen (a); then each item is re-assigne...
7. Figure 12.7 During the next iteration, new cluster centers are chosen again ...
8. Figure 12.8 The change in cluster center (a) did not result in change in clu...
9. Figure 12.9 Visualization of the three clusters created for Colleges in Mary...
10. Figure 12.10 The elbow method
11. Figure 12.11 Determining the appropriate number of clusters using the elbow ...
12. Figure 12.12 Determining the appropriate number of clusters using the averag...
13. Figure 12.13 Determining the appropriate number of clusters using the gap st...
14. Figure 12.14 Visualization of the colleges in Maryland segmented into four c...
15. Figure 12.15 All three statistical methods for determining the optimal numbe...
16. Figure 12.16 Shopping mall customers segmented into six clusters based on th...

Published simultaneously in Canada

ISBN: 978-1-119-59151-1
ISBN: 978-1-119-59153-5 (ebk)
ISBN: 978-1-119-59157-3 (ebk)

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2020933607

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Introduction

Machine learning is changing the world. Every organization, large and small, seeks to extract knowledge from the massive amounts of information that they store and process on a daily basis. The tantalizing desire to predict the future drives the work of business analysts and data scientists in fields ranging from marketing to healthcare. Our goal with this book is to make the tools of analytics approachable for a broad audience.

The R programming language is a purpose-specific language designed to facilitate statistical analysis and machine learning. We choose it for this book not only due to its strong popularity in the field but also because of its intuitive nature, particularly for individuals approaching it as their first programming language.

There are many books on the market that cover practical applications of machine learning, designed for businesspeople and onlookers. Likewise, there are many deeply technical resources that dive into the mathematics and computer science of machine learning. In this book, we strive to bridge these two worlds. We attempt to bring the reader an intuitive introduction to machine learning with an eye on the practical applications of machine learning in today's world. At the same time, we don't shy away from code. As we do in our undergraduate and graduate courses, we seek to make the R programming language accessible to everyone. Our hope is that you will read this book with your laptop open next to you, following along with our examples and trying your hand at the exercises.

Best of luck as you begin your machine learning adventure!

WHAT DOES THIS BOOK COVER?

This book provides an introduction to machine learning using the R programming language.

Chapter 1: What Is Machine Learning? This chapter introduces the world of machine learning and describes how machine learning allows the discovery of knowledge in data. In this chapter, we explain the differences between unsupervised learning, supervised learning, and reinforcement learning. We describe the differences between classification and regression problems and explain how to measure the effectiveness of machine learning algorithms.
Chapter 2: Introduction to R and RStudio In this chapter, we introduce the R programming language and the toolset that we will be using throughout the rest of the book. We approach R from the beginner's mind-set, explain the use of the RStudio integrated development environment, and walk readers through the creation and execution of their first R scripts. We also explain the use of packages to redistribute R code and the use of different data types in R.
Chapter 3: Managing Data This chapter introduces readers to the concepts of data management and the use of R to collect and manage data. We introduce the tidyverse, a collection of R packages designed to facilitate the analytics process, and we describe different approaches to describing and visualizing data in R. We also cover how to clean, transform, and reduce data to prepare it for machine learning.
Chapter 4: Linear Regression In this chapter, we dive into the world of supervised machine learning as we explore linear regression. We explain the underlying statistical principles behind regression and demonstrate how to fit simple and complex regression models in R. We also explain how to evaluate, interpret, and apply the results of regression models.
Chapter 5: Logistic Regression While linear regression is suitable for problems that require the prediction of numeric values, it is not well-suited to categorical predictions. In this chapter, we describe logistic regression, a categorical prediction technique. We discuss the use of generalized linear models and describe how to build logistic regression models in R. We also explain how to evaluate, interpret, and improve upon the results of a logistic regression model.
Chapter 6: k-Nearest Neighbors The k-nearest neighbors technique allows us to predict the classification of a data point based on the classifications of other, similar data points. In this chapter, we describe how the k-NN process works and demonstrate how to build a k-NN model in R. We also show how to apply that model, making predictions about the classifications of new data points.
Chapter 7: Naïve Bayes The naïve Bayes approach to classification uses a table of probabilities to predict the likelihood that an instance belongs to a particular class. In this chapter, we discuss the concepts of joint and conditional probability and describe how the Bayes classification approach functions. We demonstrate building a naïve Bayes classifier in R and use it to make predictions about previously unseen data.
Chapter 8: Decision Trees Decision trees are a popular modeling technique because they produce intuitive results. In this chapter, we describe the creation and interpretation of decision tree models. We also explain the process of growing a tree in R and using pruning to increase the generalizability of that model.
Chapter 9: Evaluating Performance No modeling technique is perfect. Each has its own strengths and weaknesses and brings different predictive power to different types of problems. In this chapter, we discuss the process of evaluating model performance. We introduce resampling techniques and explain how they can be used to estimate the future performance of a model. We also demonstrate how to visualize and evaluate model performance in R.
Chapter 10: Improving Performance Once we have tools to evaluate the performance of a model, we can then apply them to help improve model performance. In this chapter, we look at techniques for tuning machine learning models. We also demonstrate how we can enhance our predictive power by simultaneously harnessing the predictive capability of multiple models.
Chapter 11: Discovering Patterns with Association Rules Association rules help us discover patterns that exist within a dataset. In this chapter, we introduce the association rules approach and demonstrate how to generate association rules from a dataset in R. We also explain ways to evaluate and quantify the strength of association rules.
Chapter 12: Grouping Data with Clustering Clustering is an unsupervised learning technique that groups items based on their similarity to each other. In this chapter, we explain the way that the k-means clustering algorithm segments data and demonstrate the use of k-means clustering in R.

READER SUPPORT FOR THIS BOOK

In order to make the most of this book, we encourage you to make use of the student and instructor materials made available on the companion site. We also encourage you to provide us with meaningful feedback on ways in which we could improve the book.

Companion Download Files

As you work through the examples in this book, you may choose either to type in all the code manually or to use the source code files that accompany the book. If you choose to follow along with the examples, you will also want to use the same datasets we use throughout the book. All the source code and datasets used in this book are available for download from www.wiley.com/go/pmlr.

How to Contact the Publisher

If you believe you've found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts an error may occur.

To submit your possible errata, please email it to our customer service team at wileysupport@wiley.com with the subject line “Possible Book Errata Submission.”

PRACTICAL MACHINE LEARNING IN R

About the Authors

About the Technical Editors

Acknowledgments

Introduction

WHAT DOES THIS BOOK COVER?

READER SUPPORT FOR THIS BOOK

Companion Download Files

How to Contact the Publisher

PART I
Getting Started