Cover
Introduction
1. Aims of This Book
2. “Hands-On” Means Hands-On
3. “What About the Math?”
4. What Will You Have Learned by the End?
5. Balancing Theory and Hands-on Learning
6. Source Code for This Book
7. Using Git
CHAPTER 1: What Is Machine Learning?
1. History of Machine Learning
2. Algorithm Types for Machine Learning
3. The Human Touch
4. Uses for Machine Learning
5. Languages for Machine Learning
6. Software Used in This Book
7. Data Repositories
8. Summary
CHAPTER 2: Planning for Machine Learning
1. The Machine Learning Cycle
2. It All Starts with a Question
3. I Don't Have Data!
4. One Solution Fits All?
5. Defining the Process
6. Building a Data Team
7. Data Processing
8. Data Storage
9. Data Privacy
10. Data Quality and Cleaning
11. Thinking About Input Data
12. Thinking About Output Data
13. Don't Be Afraid to Experiment
14. Summary
CHAPTER 3: Data Acquisition Techniques
1. Scraping Data
2. Using an API
3. Migrating Data
4. Summary
CHAPTER 4: Statistics, Linear Regression, and Randomness
1. Working with a Basic Dataset
2. Introducing Basic Statistics
3. Using Simple Linear Regression
4. Embracing Randomness
5. Summary
CHAPTER 5: Working with Decision Trees
1. The Basics of Decision Trees
2. Decision Trees in Weka
3. Summary
CHAPTER 6: Clustering
1. What Is Clustering?
2. Where Is Clustering Used?
3. Clustering Models
4. K-Means Clustering with Weka
5. Summary
CHAPTER 7: Association Rules Learning
1. Where Is Association Rules Learning Used?
2. How Association Rules Learning Works
3. Algorithms
4. Mining the Baskets—A Walk-Through
5. Summary
CHAPTER 8: Support Vector Machines
1. What Is a Support Vector Machine?
2. Where Are Support Vector Machines Used?
3. The Basic Classification Principles
4. How Support Vector Machines Approach Classification
5. Using Support Vector Machines in Weka
6. Summary
CHAPTER 9: Artificial Neural Networks
1. What Is a Neural Network?
2. Artificial Neural Network Uses
3. Trusting the Black Box
4. Breaking Down the Artificial Neural Network
5. Data Preparation for Artificial Neural Networks
6. Artificial Neural Networks with Weka
7. Implementing a Neural Network in Java
8. Developing Neural Networks with DeepLearning4J
9. Summary
CHAPTER 10: Machine Learning with Text Documents
1. Preparing Text for Analysis
2. TF/IDF
3. Word2Vec
4. Basic Sentiment Analysis
5. Summary
CHAPTER 11: Machine Learning with Images
1. What Is an Image?
2. Basic Classification with Neural Networks
3. Convolutional Neural Networks
4. Transfer Learning
5. Summary
CHAPTER 12: Machine Learning Streaming with Kafka
1. What You Will Learn in This Chapter
2. From Machine Learning to Machine Learning Engineer
3. From Batch Processing to Streaming Data Processing
4. What Is Kafka?
5. Installing Kafka
6. Topics Management
7. Kafka Tool UI
8. Writing Your Own Producers and Consumers
9. Building a Streaming Machine Learning System
10. Kafka Topics
11. Kafka Connect
12. The REST API Microservice
13. Processing Commands and Events
14. Making Predictions
15. Running the Project
16. Summary
CHAPTER 13: Apache Spark
1. Spark: A Hadoop Replacement?
2. Java, Scala, or Python?
3. Downloading and Installing Spark
4. A Quick Intro to Spark
5. Comparing Hadoop MapReduce to Spark
6. Writing Stand-Alone Programs with Spark
7. Spark SQL
8. Spark Streaming
9. MLib: The Machine Learning Library
10. Summary
CHAPTER 14: Machine Learning with R
1. Installing R
2. Your First Run
3. Installing R-Studio
4. The R Basics
5. Simple Statistics
6. Simple Linear Regression
7. Basic Sentiment Analysis
8. Apriori Association Rules
9. Accessing R from Java
10. Summary
APPENDIX A: Kafka Quick Start
1. Installing Kafka
2. Starting Zookeeper
3. Starting Kafka
4. Creating Topics
5. Listing Topics
6. Describing a Topic
7. Deleting Topics
8. Running a Console Producer
9. Running a Console Consumer
APPENDIX B: The Twitter API Developer Application Configuration
APPENDIX C: Useful Unix Commands
1. Using Sample Data
2. Showing the Contents: cat, more, and less
3. Filtering Content: grep
4. Sorting Data: sort
5. Finding Unique Occurrences: uniq
6. Showing the Top of a File: head
7. Counting Words: wc
8. Locating Anything: find
9. Combining Commands and Redirecting Output
10. Picking a Text Editor
APPENDIX D: Further Reading
1. Machine Learning
2. Statistics
3. Big Data and Data Science
4. Visualization
5. Making Decisions
6. Datasets
7. Blogs
8. Useful Websites
9. The Tools of the Trade
Index
End User License Agreement

List of Illustrations

Chapter 2
1. Figure 2.1: The machine learning process
Chapter 3
1. Figure 3.1: Wikipedia list of the busiest airports in United Kingdom
2. Figure 3.2: Text file of 2017–2018 data
3. Figure 3.3: Spreadsheet of the busiest airports in the United Kingdom
Chapter 4
1. Figure 4.1: Excel file showing two judges’ scores
2. Figure 4.2: Scatter plot of the two judges’ scores
3. Figure 4.3: Trendline added to the scatter plot
4. Figure 4.4: R2 value and equation
5. Figure 4.5: Initial drawing of a square
6. Figure 4.6: Circle within a square
7. Figure 4.7: Random darts within the circle and the square
Chapter 5
1. Figure 5.1: A decision tree
2. Figure 5.2: The Weka GUI Chooser
3. Figure 5.3: The basic Explorer window
4. Figure 5.4: The preprocess pane with data
5. Figure 5.5: Selecting the classifier
6. Figure 5.6: Classifier with output
7. Figure 5.7: J48 visualization
8. Figure 5.8: Evaluation options pane
Chapter 6
1. Figure 6.1: A graph representation of a cluster
2. Figure 6.2: Nodes and edges as clusters
3. Figure 6.3: Euclidean distances
4. Figure 6.4: The elbow method graph
5. Figure 6.5: Loading CSV data into Weka
6. Figure 6.6: The Preprocess window
7. Figure 6.7: Selecting SimpleKMeans
8. Figure 6.8: Changing the SimpleKMeans options
9. Figure 6.9: Visualize window
10. Figure 6.10: Eclipse New Java Project dialog box
11. Figure 6.11: Adding an external JAR
12. Figure 6.12: Creating a new class file
Chapter 7
1. Figure 7.1: The Weka Explorer
2. Figure 7.2: Weka File Explorer
3. Figure 7.3: The Data Preprocess section
4. Figure 7.4: Weka Associate tab
5. Figure 7.5: The Options pane
6. Figure 7.6: The generated results
Chapter 8
1. Figure 8.1: Two objects to classify
2. Figure 8.2: Three objects to classify
3. Figure 8.3: Linear classification with a hyperplane
4. Figure 8.4: Support vector machines max margin hyperplane
5. Figure 8.5: The support vectors on the hyperplane edges
6. Figure 8.6: Objects rarely go where you want them to go.
7. Figure 8.7: GUI Chooser
8. Figure 8.8: Loading the .csv file
9. Figure 8.9: Choosing the LibSVM classifier
10. Figure 8.10: Changing the percentage split
11. Figure 8.11: Classifier Evaluation Options dialog box
12. Figure 8.12: Changing the kernel type
13. Figure 8.13: Creating the new Java project
14. Figure 8.14: Adding the required JAR files
15. Figure 8.15: Creating a new Java class
Chapter 9
1. Figure 9.1: The neuron structure
2. Figure 9.2: A simple perceptron
3. Figure 9.3: Perceptron with two inputs
4. Figure 9.4: Sigmoid function
5. Figure 9.5: AND gate perceptron
6. Figure 9.6: XOR gate network
7. Figure 9.7: Multilayer perceptron with one hidden layer
8. Figure 9.8: Weka Explorer
9. Figure 9.9: Weka File dialog box
10. Figure 9.10: Changing the classifier
11. Figure 9.11: Options dialog box for MultilayerPerceptron
12. Figure 9.12: Neural network GUI window
13. Figure 9.13: Eclipse New Project dialog box
14. Figure 9.14: Adding external JARs
15. Figure 9.15: Creating a new class file
Chapter 10
1. Figure 10.1: Apache Tika GUI
Chapter 11
1. Figure 11.1: An 8 x 8–pixel image
2. Figure 11.2: Numeric image of an 8 x 8–pixel
3. Figure 11.3: 24-bit image
4. Figure 11.4: 5 x 5–pixel image
5. Figure 11.5: Filter matrix
6. Figure 11.6: CNN output values
7. Figure 11.7: Filter values set at vertical and horizontal
Chapter 12
1. Figure 12.1: Topics written sequentially in Kafka
2. Figure 12.2: Relationship of producers to the Kafka cluster and consumers
3. Figure 12.3: Topics split into partitions
4. Figure 12.4: Multibroker cluster
5. Figure 12.5: Control center
6. Figure 12.6: Kafka Tool
7. Figure 12.7: System plan
8. Figure 12.8: Calculation for hidden nodes
9. Figure 12.9: Swagger interface to test the API
10. Figure 12.10: Flow of a message
Chapter 13
1. Figure 13.1: Spark web console
Chapter 14
1. Figure 14.1: The R shell
2. Figure 14.2: R's help system
3. Figure 14.3: R-Studio
4. Figure 14.4: Horizontal bar chart
5. Figure 14.5: Vertical bar chart
6. Figure 14.6: Simple pie chart
7. Figure 14.7: Simple dot plot
8. Figure 14.8: Simple line chart
9. Figure 14.9: Seconds/dollar plot
10. Figure 14.10: Transaction frequencies
11. Figure 14.11: Adding the JRI.jar file to the project
12. Figure 14.12: Adding the JRI library path
13. Figure 14.13: Adding the environment R_HOME path
Appendix B
1. Figure B.1: Creating a new Twitter application page
2. Figure B.2: Completing the application detail page
3. Figure B.3: OAuth details

Published simultaneously in Canada

ISBN: 978-1-119-64214-5

ISBN: 978-1-119-64225-1 (ebk)

ISBN: 978-1-119-64219-0 (ebk)

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2019956691

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Acknowledgments

“Never again!” I think those were my final words after completing the first edition of this book. Five years later, and here we are again. When the call comes, you immediately think, “Well, it can't be hard, can it?”

To the Team

Jim Minatel, Devon Lewis, Janet Wehner, Pete Gaughan, and the rest of the team at Wiley, thank you for giving your blessing to this second edition and putting your faith in me to revise an awful lot of content. Apologies for the spelling mistakes and those colour/color occurrences. Many thanks to Jacob Andresen for giving a technical overview on the content of the book. His enthusiasm for the project was wonderful.

Most Excellent Friends and Collaborators

Dearest friends and acquaintances, thank you: Jennifer Michael, Marie Bentall, Tim Brundle, Stephen Houston, Garrett Murphy, Clare Conway, Tom Spinks, Matt Johnston, Alan Edwards, Colin Mitchell, Simon Hewitt, Mary McKenna, Alan Thorburn, Colin McHale, Dan Lyons, Victoria McCallum, Andrew Bolster, Eoin McFadden, Catherine Muldoon, Amanda Paver, Ben Lorica, Alastair Croll, Mark Madsen, Ellen Friedman, Ted Dunning, Sophia DeMartini, Bruce Durling, Francine Bennett, Michelle Varron, Elise Huard, Antony Woods, John Stephenson, McCraigMcCraig of the Clan McCraig, everyone on the Clojurians Slack Channel, the Strata Data community, Carla Gaggini, Kiki Schirr, Wendy Devolder, Brian O'Neill, Anthony O'Connor, Tom Gray, Deepa Mann-Kler, Alan Hook, Michelle Douglas, Pete Harwood, Jen Samuel, and Colin Masters. There are loads I've forgotten, I know. I'm sorry.

And Finally

To my wife, Wendy, and my daughter, Clarissa, for absolutely everything and encouraging me to do these projects to the best of my nerdy ability. I couldn't have done it without you both.

To the rest of my family, Maggie, Fern, Andrew, Kerry, Ian and Margaret, William and Sylvia, thank you for all the support and kind words. William, if I need any more help, I'll call you.

The Bios That Never Made It…

“He has the boots and jacket that were the envy of many men.”

“A dab hand at late-night YouTube videos of 80s pop stars.”

“Jason Bell learned to play bass guitar on Saturday afternoons while pretending to work in a music shop.”

Thanks to everyone who reads this book. I hope it's helpful in your journey. It's an honor and privilege that you chose to read it. Now I believe it's time for a cup of tea.

Introduction

Well, times have changed since writing the first edition of this book. Between 2014 and now there is more emphasis on data and what it can do for us but also how that power can be used against us. Hardware has gotten better, processing has gotten much faster, and the ability to classify, predict, and decide based on our data is extraordinary. At the same time, we've become much more aware of the risks of how data is used, the biases that can happen, and that a lot of black-box models don't always get things right.

Still, it's an exciting time to be involved. We still create more data than we can sensibly process. New ideas involving machine learning are being presented daily. The appetite for learning has grown rapidly, too.

Data mining and machine learning have been around a number of years already. When you look closely, the machine learning algorithms that are being applied aren't any different from what they were years ago; what is new is how they are applied at scale. When you look at the number of organizations that are creating the data, it's really, in my opinion, a minority. Google, Facebook, Twitter, Netflix, and a small handful of others are the ones getting the majority of mentions in the headlines with a mixture of algorithmic learning and tools that enable them to scale. So, the real question you should ask is, “How does all this apply to the rest of us?”

Data with large scale, near-instant processing, has come to the fore. The emphasis has moved from batch systems like Hadoop to more streaming-based systems like Kafka. I admit there will be times in this book when I look at the Big Data side of machine learning—it's a subject I can't ignore—but it's only a small factor in the overall picture of how to get insight from the available data. It is important to remember that I am talking about tools, and the key is figuring out which tools are right for the job you are trying to complete.

Aims of This Book

This book is about machine learning and not about Big Data. It's about the various techniques used to gain insight from your data. By the end of the book, you will have seen how various methods of machine learning work, and you will also have had some practical explanations on how the code is put together, leaving you with a good idea of how you could apply the right machine learning techniques to your own problems.

There's no right or wrong way to use this book. You can start at the beginning and work your way through, or you can just dip in and out of the parts you need to know at the time you need to know them.

“Hands-On” Means Hands-On

Many books on the subject of machine learning that I've read in the past have been very heavy on theory. That's not a bad thing. If you're looking for in-depth theory with really complex-looking equations, I applaud your rigor. Me? I'm more hands-on with my approach to learning and to projects. My philosophy is quite simple.

Start with a question in mind.
Find the theory I need to learn.
Find lots of examples I can learn from.
Put them to work in my own projects.

As a software developer, I like to see lots of examples. As a teacher, I like to get as much hands-on development time as possible but also get the message across to students as simply as possible. There's something about fingers on keys, coding away on your IDE, and getting things to work that's rather appealing, and it's something that I want to convey in the book.

Everyone has his or her own learning styles. I believe this book covers the most common methods, so everybody will benefit.

“What About the Math?”

Like arguing that your favorite football team is better than another or trying to figure out whether Jimmy Page is a better guitarist than Jeff Beck (I prefer Beck), there are some things that will be debated forever and a day. One such debate is how much math you need to know before you can start doing machine learning.

Doing machine learning and learning the theory of machine learning are two very different subjects. To learn the theory, a good grounding in math is required. This book discusses a hands-on approach to machine learning. With the number of machine learning tools available for developers now, the emphasis is not so much on how these tools work but on how you can make these tools work for you. The hard work has been done, and those who did it deserve credit and applause.

“But You Need a PhD!”

No, you don't!

The long-running debate rages on about the level of knowledge you need before you can start doing analysis on data or claim that you are a data scientist. I believe that if you'd like to take a few years completing a degree and then pursuing the likes of a master's degree and then a PhD, you should feel free to go that route. I'm a little more pragmatic about things and like to get reading and start doing.

Academia is great; and with the large number of online courses, papers, websites, and books on the subject of math, statistics, and data mining, there's enough to keep the most eager of minds occupied. I dip in and out of these resources a lot, and it's definitely a good way to keep up-to-date and investigate what's emerging.

For me, though, there's nothing like getting my hands dirty, grabbing some data, trying out some methods, and looking at the results. If you need to brush up on linear regression theory, then let me reassure you now, there's plenty out there to read, and I'll also cover that in this book.

Lastly, can one person ever be a data scientist? I think it's more likely for a team of people to bring the various skills needed for machine learning into an organization. I talk about this more in Chapter 2.

So, while others in the office are arguing whether to bring some PhD brains in on a project, you can be coding up a decision tree to see whether it's viable.

Over the last few years the job title data scientist has been joined by other titles like data engineer and machine learning engineer. All are valid and all focus on aspects of the data science pipeline. They all have their place.

What Will You Have Learned by the End?

Assuming that you're reading the book from start to finish, you'll learn the common uses for machine learning, different methods of machine learning, and how to apply real-time and batch processing.

There's also nothing wrong with referencing a specific section that you want to learn. The chapters and examples were created in such a way that there's no dependency to learn one chapter over another.

The aim is to cover the common machine learning concepts in a practical manner. Using the existing free tools and libraries that are available to you, there's little stopping you from starting to gain insight from the existing data that you have.

Balancing Theory and Hands-on Learning

There are many books on machine learning and data mining available, and finding the balance of theory and practical examples is hard. When planning this book, I stressed the importance of practical and easy-to-use examples, providing step-by-step instructions, so you can see how things are put together.

I'm not saying that the theory is light, because it's not. Understanding what you want to learn or, more importantly, how you want to learn will determine how you read this book.

You can think of the book split into three distinct sections. The first section covers the question, “What is machine learning?” and concentrates on planning for projects, data acquisition, and cleaning. For those wanting some refresher on the math and stats side of things, I've included a new chapter; it also covers linear regression and standard deviation.

The next section takes a closer look at some of the building-block algorithms used in machine learning projects. Clustering, decision trees, support vector machine, association rules learning, and neural networks provide both a background to how they work and code examples for you to work with. It's important to get the hands-on nature early on.

Lastly, I focus on the real-world tools used in enterprise; these are tools like Spark, Kafka, and R. Knowing how these frameworks and tools are put together will give you a grounding to know what to use when.

Source Code for This Book

All the code that is explained in the chapters of the book has been saved on a GitHub repository for you to download and try. For this edition, I've also included the Maven dependency file so you can easily build the project you are working on.

The address for the repository is https://github.com/jasebell/mlbook2ndedition. You can also find it on the Wiley website at www.wiley.com/go/machinelearning2e.

The examples are in either Java, Clojure, or R. If you want to extend your knowledge into other languages, then a search around the GitHub site might lead you to some interesting examples.

Code has been separated by chapter; there's a folder in the repository for each of the chapters, and each has its own build file. The data is also within the repository in the data directory and has been split by each chapter.

Using Git

Git is a version control system that is widely used in business and the open source software community. If you are working in teams, it becomes useful because you can create branches of the codebase to work on then merge the changes afterward.

The uses for Git in this book are limited, but you need it for “cloning” the repository of examples if you want to use them.

To clone the examples for this book, use the following commands:

$mkdir mlbookexamples
$cd mlbookexamples
$git clone https://github.com/jasebell/mlbook2ndedition.git

You see the progress of the cloning, and when it's finished, you'll be able to change directories to the newly downloaded folder and look at the code samples.

Machine Learning

Hands-On for Developers and Technical Professionals

About the Author

About the Technical Editor