Cover: Machine Learning, Second Edition by Jason Bell
image

Machine Learning

Hands-On for Developers and Technical Professionals

 

Second Edition

Jason Bell

 

 

 

 

 

 

Wiley Logo

To all the developers who just wanted to get the code working without reading all the math stuff first.

About the Author

Jason Bell has worked in software development for more than 30 years. Currently he focuses on large-volume data solutions and helping retail and finance customers gain insight from data with machine learning. He is also an active committee member for several international technology conferences.

About the Technical Editor

Jacob Andresen works as a senior software developer based in Copenhagen, Denmark. He has been working as a software developer and consultant in information retrieval systems and web applications since 2002.

Acknowledgments

“Never again!” I think those were my final words after completing the first edition of this book. Five years later, and here we are again. When the call comes, you immediately think, “Well, it can't be hard, can it?”

To the Team

Jim Minatel, Devon Lewis, Janet Wehner, Pete Gaughan, and the rest of the team at Wiley, thank you for giving your blessing to this second edition and putting your faith in me to revise an awful lot of content. Apologies for the spelling mistakes and those colour/color occurrences. Many thanks to Jacob Andresen for giving a technical overview on the content of the book. His enthusiasm for the project was wonderful.

Most Excellent Friends and Collaborators

Dearest friends and acquaintances, thank you: Jennifer Michael, Marie Bentall, Tim Brundle, Stephen Houston, Garrett Murphy, Clare Conway, Tom Spinks, Matt Johnston, Alan Edwards, Colin Mitchell, Simon Hewitt, Mary McKenna, Alan Thorburn, Colin McHale, Dan Lyons, Victoria McCallum, Andrew Bolster, Eoin McFadden, Catherine Muldoon, Amanda Paver, Ben Lorica, Alastair Croll, Mark Madsen, Ellen Friedman, Ted Dunning, Sophia DeMartini, Bruce Durling, Francine Bennett, Michelle Varron, Elise Huard, Antony Woods, John Stephenson, McCraigMcCraig of the Clan McCraig, everyone on the Clojurians Slack Channel, the Strata Data community, Carla Gaggini, Kiki Schirr, Wendy Devolder, Brian O'Neill, Anthony O'Connor, Tom Gray, Deepa Mann-Kler, Alan Hook, Michelle Douglas, Pete Harwood, Jen Samuel, and Colin Masters. There are loads I've forgotten, I know. I'm sorry.

And Finally

To my wife, Wendy, and my daughter, Clarissa, for absolutely everything and encouraging me to do these projects to the best of my nerdy ability. I couldn't have done it without you both.

To the rest of my family, Maggie, Fern, Andrew, Kerry, Ian and Margaret, William and Sylvia, thank you for all the support and kind words. William, if I need any more help, I'll call you.

The Bios That Never Made It…

“He has the boots and jacket that were the envy of many men.”

“A dab hand at late-night YouTube videos of 80s pop stars.”

“Jason Bell learned to play bass guitar on Saturday afternoons while pretending to work in a music shop.”

Thanks to everyone who reads this book. I hope it's helpful in your journey. It's an honor and privilege that you chose to read it. Now I believe it's time for a cup of tea.

Introduction

Well, times have changed since writing the first edition of this book. Between 2014 and now there is more emphasis on data and what it can do for us but also how that power can be used against us. Hardware has gotten better, processing has gotten much faster, and the ability to classify, predict, and decide based on our data is extraordinary. At the same time, we've become much more aware of the risks of how data is used, the biases that can happen, and that a lot of black-box models don't always get things right.

Still, it's an exciting time to be involved. We still create more data than we can sensibly process. New ideas involving machine learning are being presented daily. The appetite for learning has grown rapidly, too.

Data mining and machine learning have been around a number of years already. When you look closely, the machine learning algorithms that are being applied aren't any different from what they were years ago; what is new is how they are applied at scale. When you look at the number of organizations that are creating the data, it's really, in my opinion, a minority. Google, Facebook, Twitter, Netflix, and a small handful of others are the ones getting the majority of mentions in the headlines with a mixture of algorithmic learning and tools that enable them to scale. So, the real question you should ask is, “How does all this apply to the rest of us?”

Data with large scale, near-instant processing, has come to the fore. The emphasis has moved from batch systems like Hadoop to more streaming-based systems like Kafka. I admit there will be times in this book when I look at the Big Data side of machine learning—it's a subject I can't ignore—but it's only a small factor in the overall picture of how to get insight from the available data. It is important to remember that I am talking about tools, and the key is figuring out which tools are right for the job you are trying to complete.

Aims of This Book

This book is about machine learning and not about Big Data. It's about the various techniques used to gain insight from your data. By the end of the book, you will have seen how various methods of machine learning work, and you will also have had some practical explanations on how the code is put together, leaving you with a good idea of how you could apply the right machine learning techniques to your own problems.

There's no right or wrong way to use this book. You can start at the beginning and work your way through, or you can just dip in and out of the parts you need to know at the time you need to know them.

“Hands-On” Means Hands-On

Many books on the subject of machine learning that I've read in the past have been very heavy on theory. That's not a bad thing. If you're looking for in-depth theory with really complex-looking equations, I applaud your rigor. Me? I'm more hands-on with my approach to learning and to projects. My philosophy is quite simple.

  • Start with a question in mind.
  • Find the theory I need to learn.
  • Find lots of examples I can learn from.
  • Put them to work in my own projects.

As a software developer, I like to see lots of examples. As a teacher, I like to get as much hands-on development time as possible but also get the message across to students as simply as possible. There's something about fingers on keys, coding away on your IDE, and getting things to work that's rather appealing, and it's something that I want to convey in the book.

Everyone has his or her own learning styles. I believe this book covers the most common methods, so everybody will benefit.

“What About the Math?”

Like arguing that your favorite football team is better than another or trying to figure out whether Jimmy Page is a better guitarist than Jeff Beck (I prefer Beck), there are some things that will be debated forever and a day. One such debate is how much math you need to know before you can start doing machine learning.

Doing machine learning and learning the theory of machine learning are two very different subjects. To learn the theory, a good grounding in math is required. This book discusses a hands-on approach to machine learning. With the number of machine learning tools available for developers now, the emphasis is not so much on how these tools work but on how you can make these tools work for you. The hard work has been done, and those who did it deserve credit and applause.

“But You Need a PhD!”

No, you don't!

The long-running debate rages on about the level of knowledge you need before you can start doing analysis on data or claim that you are a data scientist. I believe that if you'd like to take a few years completing a degree and then pursuing the likes of a master's degree and then a PhD, you should feel free to go that route. I'm a little more pragmatic about things and like to get reading and start doing.

Academia is great; and with the large number of online courses, papers, websites, and books on the subject of math, statistics, and data mining, there's enough to keep the most eager of minds occupied. I dip in and out of these resources a lot, and it's definitely a good way to keep up-to-date and investigate what's emerging.

For me, though, there's nothing like getting my hands dirty, grabbing some data, trying out some methods, and looking at the results. If you need to brush up on linear regression theory, then let me reassure you now, there's plenty out there to read, and I'll also cover that in this book.

Lastly, can one person ever be a data scientist? I think it's more likely for a team of people to bring the various skills needed for machine learning into an organization. I talk about this more in Chapter 2.

So, while others in the office are arguing whether to bring some PhD brains in on a project, you can be coding up a decision tree to see whether it's viable.

Over the last few years the job title data scientist has been joined by other titles like data engineer and machine learning engineer. All are valid and all focus on aspects of the data science pipeline. They all have their place.

What Will You Have Learned by the End?

Assuming that you're reading the book from start to finish, you'll learn the common uses for machine learning, different methods of machine learning, and how to apply real-time and batch processing.

There's also nothing wrong with referencing a specific section that you want to learn. The chapters and examples were created in such a way that there's no dependency to learn one chapter over another.

The aim is to cover the common machine learning concepts in a practical manner. Using the existing free tools and libraries that are available to you, there's little stopping you from starting to gain insight from the existing data that you have.

Balancing Theory and Hands-on Learning

There are many books on machine learning and data mining available, and finding the balance of theory and practical examples is hard. When planning this book, I stressed the importance of practical and easy-to-use examples, providing step-by-step instructions, so you can see how things are put together.

I'm not saying that the theory is light, because it's not. Understanding what you want to learn or, more importantly, how you want to learn will determine how you read this book.

You can think of the book split into three distinct sections. The first section covers the question, “What is machine learning?” and concentrates on planning for projects, data acquisition, and cleaning. For those wanting some refresher on the math and stats side of things, I've included a new chapter; it also covers linear regression and standard deviation.

The next section takes a closer look at some of the building-block algorithms used in machine learning projects. Clustering, decision trees, support vector machine, association rules learning, and neural networks provide both a background to how they work and code examples for you to work with. It's important to get the hands-on nature early on.

Lastly, I focus on the real-world tools used in enterprise; these are tools like Spark, Kafka, and R. Knowing how these frameworks and tools are put together will give you a grounding to know what to use when.

Source Code for This Book

All the code that is explained in the chapters of the book has been saved on a GitHub repository for you to download and try. For this edition, I've also included the Maven dependency file so you can easily build the project you are working on.

The address for the repository is https://github.com/jasebell/mlbook2ndedition. You can also find it on the Wiley website at www.wiley.com/go/machinelearning2e.

The examples are in either Java, Clojure, or R. If you want to extend your knowledge into other languages, then a search around the GitHub site might lead you to some interesting examples.

Code has been separated by chapter; there's a folder in the repository for each of the chapters, and each has its own build file. The data is also within the repository in the data directory and has been split by each chapter.

Using Git

Git is a version control system that is widely used in business and the open source software community. If you are working in teams, it becomes useful because you can create branches of the codebase to work on then merge the changes afterward.

The uses for Git in this book are limited, but you need it for “cloning” the repository of examples if you want to use them.

To clone the examples for this book, use the following commands:

$mkdir mlbookexamples
$cd mlbookexamples
$git clone https://github.com/jasebell/mlbook2ndedition.git

You see the progress of the cloning, and when it's finished, you'll be able to change directories to the newly downloaded folder and look at the code samples.