Smarter Data Science by Neal Fishman, Cole Stryker, Grady Booch

Praise For This Book

The authors have obviously explored the paths toward an efficient information architecture. There is value in learning from their experience. If you have responsibility for or influence over how your organization uses artificial intelligence you will find Smarter Data Science an invaluable read. It is noteworthy that the book is written with a sense of scope that lends to its credibility. So much written about AI technologies today seems to assume a technical vacuum. We are not all working in startups! We have legacy technology that needs to be considered. The authors have created an excellent resource that acknowledges that enterprise context is a nuanced and important problem. The ideas are presented in a logical and clear format that is suitable to the technologist as well as the businessperson.

Christopher Smith, Chief Knowledge Management and Innovation Officer, Sullivan & Cromwell, LLC

It has been always been a pleasure to learn from Neal. The stories and examples that urge every business to stay "relevant" served to provide my own source of motivation. The concepts presented in this book helped to resolve issues that I have been having to address. This book teaches almost all aspects of the data industry. The experiences, patterns, and anti-patterns, are thoroughly explained. This work provides benefit to a variety of roles, including architects, developers, product owners, and business executives. For organizations exploring AI, this book is the cornerstone to becoming successful.

Harry Xuegang Huang Ph.D., External Consultant, A.P. Moller – Maersk (Denmark)

This is by far one of the best and most refreshing books on AI and data science that I have come across. The authors seek and speak the truth and they penetrate into the core of the challenge most organizations face in finding value in their data: moving focus away from a tendency to connect the winning dots by ‘magical’ technologies and overly simplified methods. The book is laid out in a well-considered and mature approach that is grounded in deliberation, pragmatism, and respect for information. By following the authors' advice, you will unlock true and long-term value and avoid the many pitfalls that fashionistas and false prophets have come to dominate the narrative in AI.

Jan Gravesen, M.Sc., IBM Distinguished Engineer, Director and Chief Technology Officer, IBM

Most of the books on data analytics and data science focus on tools and techniques of the discipline and do not provide the reader with a complete framework to plan and implement projects that solve business problems and foster competitive advantage. Just because machine learning and new methodologies learn from data and do not require a preconceived model for analysis does not eliminate the need for a robust information management program and required processes. In Smarter Data Science, the authors present a holistic model that emphasizes how critical data and data management are in implementing successful value-driven data analytics and AI solutions. The book presents an elegant and novel approach to data management and explores its various layers and dimensions (from data creation/ownership and governance to quality and trust) as a key component of a well-integrated methodology for value-adding data sciences and AI. The book covers the components of an agile approach to data management and information architecture that fosters business innovation and can adapt to ever changing requirements and priorities. The many examples of recent data challenges facing diverse businesses make the book extremely readable and relevant for practical applications. This is an excellent book for both data officers and data scientists to gain deep insights into the fundamental relationship between data management, analytics, machine learning, and AI.

Ali Farahani, Ph.D., Former Chief Data Officer, County of Los Angeles; Adjunct Associate Professor, USC

There are many different approaches to gaining insights with data given the new advances in technology today. This book encompasses more than the technology that makes AI and machine learning possible, but truly depicts the process and foundation needed to prepare that data to make AI consumable and actionable. I thoroughly enjoyed the section on data governance and the importance of accessible, accurate, curated, and organized data for any sort of analytics consumption. The significance and differences in zones and preparation of data also has some fantastic points that should be highly considered in any sort of analytics project. The authors' ability to describe best practices from a true journey of data within an organization in relation to business needs and information outcomes is spot on. I would highly recommend this book to anyone learning, playing, or working in the wonderful space of Data & AI.

Phil Black, VP of Client Services for Data and AI, TechD

The authors have pieced together data governance, data architecture, data topologies, and data science in a perfect way. Their observations and approach have paved the way towards achieving a flexible and sustainable environment for advanced analytics. I am adopting these techniques in building my own analytics platform for our company.

Svetlana Grigoryeva, Manager Data Services and AI, Shearman and Sterling

This book is a delight to read and provides many thought-provoking ideas. This book is a great resource for data scientists, and everyone who is involved with large scale, enterprise-wide AI initiatives.

Simon Seow, Managing Director, Info Spec Sdn Bhd (Malaysia)

Having worked in IT as a Vice president at MasterCard and as a Global Director at GM, I learned long ago about the importance of finding and listening to the best people. Here, the authors have brought a unique and novel voice that resonates with verve about how to be successful with data science at an enterprise scale. With the explosive growth of big data, computer power, cheap sensor technology, and the awe-inspiring breakthroughs with AI, Smarter Data Science also instills in us that without a solid information architecture, we may fall short in our work with AI.

Glen Birrell, Executive IT Consultant

In the 21st century the ability to use metadata to empower cross-industry ecosystems and exploit a hierarchy of AI algorithms will be essential to maximize stakeholder value. Today's data science processes and systems simply don't offer enough speed, flexibility, quality or context to enable that. Smarter Data Science is a very useful book as it provides concrete steps towards wisdom within those intelligent enterprises.

Richard Hopkins, President, Academy of Technology, IBM (UK)

A must read for everyone who curates, manages, or makes decisions on data. Lifts a lot of the mystery and magical thinking out of “Data Science” to explain why we're underachieving on the promise of AI. Full of practical ideas for improving the practice of information architecture for modern analytical environments using AI or ML. Highly recommended.

Linda Nadeau, Information Architect, Metaphor Consulting LLC

In this book, the authors “unpack” the meaning of data as a natural resource for the modern corporation. Following on Neal's previous book that explored the role of data in enterprise transformation, the authors construct and lead the reader through a holistic approach to drive business value with data science. This book examines data, analytics, and the AI value chain across several industries describing specific use and business cases. This book is a must read for Chief Data Officers as well as accomplished or inspiring data scientists in any industry.

Boris Vishnevsky, Principal, Complex Solutions and Cyber Security, Slalom; Adjunct Professor, TJU

As an architect working with clients on highly complex projects, all of my new projects involve vast amounts of data, distributed sources of data, cloud-based technologies, and data science. This book is invaluable for my real-world enterprise scale practice. The anticipated risks, complexities, and the rewards of infusing AI is laid out in a well-organized manner that is easy to comprehend taking the reader out of the scholastic endeavor of fact-based learning and into the real world of data science. I would highly recommend this book to anyone wanting to be meaningfully involved with data science.

John Aviles, Federal CTO Technical Lead, IBM

I hold over 150 patents and work as a data scientist on creating some of the most complex AI business projects, and this book has been of immense value to me as a field guide. The authors have established the need as to why IA must be part of a systematic maturing approach to AI. I regard this book as a “next generation AI guidebook” that your organization can't afford to be without.

Gandhi Sivakumar, Chief Architect and Master Inventor, IBM (Australia)

A seminal treatment for how enterprises must leverage AI. The authors provide a clear and understandable path forward for using AI across cloud, fog, and mist computing. A must read for any serious data scientist and data manager.

Raul Shneir, Director, Israel National Cyber Directorate (Israel)

As a professor at Wharton who teaches data science I often mention to my students about emerging new analytical tools such as AI that can provide valuable information to business decision makers. I also encourage them to keep abreast of such tools. Smarter Data Science will definitely make my recommended readings list. It articulates clearly how an organization can build a successful Information architecture, capitalizing on AI technologies benefits. The authors have captured many intricate themes that are relevant for my students to carry with them into the business world. Many of the ideas presented in this book will benefit those working directly in the field of data science or those that will be impacted by data science. The book also includes many critical thinking tools to ready the worker of tomorrow … and realistically, today.

Dr. Josh Eliashberg, Sebastian S. Kresge Professor of Marketing, Professor of Operations, Information, and Decisions, The Wharton School

This is an excellent guide for the data-driven organization that must build a robust information architecture to continuously deliver greater value through data science or be relegated to the past. The book will enable organizations to complete their transformative journey to sustainably leverage AI technologies that incorporate cloud-based AI tools and dueling neural networks. The guiding principles that are laid out in the book should result in the democratization of data, a data literate workforce, and a transparent AI revolution.

Taarini Gupta, Behavioral Scientist/Data Scientist, Mind Genomics Advisors

Smarter Data Science

Succeeding with Enterprise-Grade Data and AI Projects

 

 

Neal Fishman with Cole Stryker

 

 

 

 

Wiley Logo

ATM4

About the Authors

Neal Fishman is an IBM Distinguished Engineer and is the CTO for Data-Based Pathology within IBM's Global Business Services organization. Neal is also an Open Group Certified Distinguished IT Architect. Neal has extensive experience working with IBM's clients across six continents on complex data and AI initiatives.

Neal has previously served as a board member for several different industry communities and was the technology editor for the BRCommunity webzine. Neal has been a distance learning instructor with the University of Washington and has recorded some of his other insights in Viral Data in SOA: An Enterprise Pandemic and Enterprise Architecture Using the Zachman Framework. Neal also holds several data-related patents.

You can connect with Neal on LinkedIn at linkedin.com/in/neal-fishman-.

Cole Stryker is an author and journalist based in Los Angeles. He is the author of Epic Win for Anonymous, the story of a global gang of hackers and trolls who took on big corporations and governments, and Hacking the Future, which charts the history of anonymity and makes a case for its future as a form of cultural and political expression. His writing has appeared in Newsweek, The Nation, NBC News, Salon, Vice, Boing Boing, The NY Observer, The Huffington Post, and elsewhere.

You can connect with Cole on LinkedIn at linkedin.com/in/colestryker.

Acknowledgments

I want to express my sincere gratitude to Jim Minatel at John Wiley & Sons for giving me this opportunity. I would also like to sincerely thank my editor, Tom Dinse, for his attention to detail and for his excellent suggestions in helping to improve this book. I am very appreciative of the input provided by Tarik El-Masri, Alex Baryudin, and Elis Gitin. I would also like to thank Matt Holt, Devon Lewis, Pete Gaughan, Kenyon Brown, Kathleen Wisor, Barath Kumar Rajasekaran, Steven Stansel, Josephine Schweiloch, and Betsy Schaefer.

During my career, there have been several notable giants with whom I have worked and upon whose shoulders I clearly stand. Without these people, my career would not have taken the right turns: John Zachman, Warren Selkow, Ronald Ross, David Hay, and the late John Hall. I would like to recognize the renowned Grady Booch for his graciousness and kindness to contribute the Foreword. Finally, I would like to acknowledge the efforts of Cole Stryker for helping take this book to the next level.

Neal Fishman

Thanks to Jim Minatel, Tom Dinse, and the rest of the team at Wiley for recognizing the need for this book and for enhancing its value with their editorial guidance. I'd also like to thank Elizabeth Schaefer for introducing me to Neal and giving me the opportunity to work with him. Thanks also to Jason Oberholtzer and the folks at Gather for enabling my work at IBM. Lastly, I'm grateful to Neal Fishman for sharing his vision and inviting me to contribute to this important book.

Cole Stryker

Foreword for Smarter Data Science

There have been remarkable advances in artificial intelligence the past decade, owing to a perfect storm at the confluence of three important forces: the rise of big data, the exponential growth of computational power, and the discovery of key algorithms for deep learning. IBM's Deep Blue beat the world's best chess player, Watson bested every human on Jeopardy, and DeepMind's AlphaGo and AlphaZero have dominated the field of Go and videogames. On the one hand, these advances have proven useful in commerce and in science: AI has found an important role in manufacturing, banking, and medicine, to name a few domains. On the other hand, these advances raise some difficult questions, especially with regard to privacy and the conduct of war.

While discoveries in the science of artificial intelligence continue, the fruits of that science are now being put to work in the enterprise in very tangible ways, ways that are not only economically interesting but that also contribute to the human condition. As such, enterprises that want to leverage AI must turn their focus to engineering pragmatic systems of value that contain cognitive components.

That's where Smarter Data Science comes in.

As the authors explain, data is not an afterthought in building such systems; it is a forethought. To leverage AI for predicting, automating, and optimizing enterprise outcomes, the science of data must be made an intentional, measurable, repeatable, and agile part of the development pipeline. Here, you'll learn about best practices for collecting, organizing, analyzing, and infusing data in ways that make AI real for the enterprise. What I celebrate most about this book is that not only are the authors able to explain these best practices from a foundation of deep experience, they do so in a manner that is actionable. Their emphasis on results-driven methodology that is agile yet enables a strong architectural framework is refreshing.

I'm not a data scientist; I'm a systems engineer, and increasingly I find myself working with data scientists. Believe me, this is a book that has taught me many things. I think you'll find it quite informative as well.

Grady Booch
ACM, IEEE, and IBM Fellow

Epigraph

“There is no AI without IA.”

Seth Earley

IT Professional, vol. 18, no. 03, 2016.

(info.earley.com/hubfs/EIS_Assets/ITPro-Reprint-No-AI-without-IA.pdf)

In 2016, IT consultant and CEO Seth Earley wrote an article titled “There is no AI without IA” in an IEEE magazine called IT Professional. Earley put forth an argument that enterprises seeking to fully capitalize on the capabilities of artificial intelligence must first build out a supporting information architecture. Smarter Data Science provides a comprehensive response: an IA for AI.

Preamble

“What I'm trying to do is deliver results.”

Lou Gerstner

Business Week

Why You Need This Book

“No one would have believed in the last years of the nineteenth century that this world was being watched keenly and closely…”

So begins H. G. Wells' The War of the Worlds, 1898, Harper&Brothers. In the last years of the 20th century, such disbelief also prevailed. But unlike the fictional watchers from the 19th century, the late-20th century watchers were real, pioneering digitally enabled corporations. In The War of the Worlds, simple bacteria proved to be a defining weapon for both offense and defense. Today, the ultimate weapon is data. When misusing data, a corporate entity can implode. When data is used appropriately, a corporate entity can thrive.

Ever since the establishment of hieroglyphs and alphabets, data has been useful. The term business intelligence (BI) can be traced as far back as 1865 (ia601409.us.archive.org/25/items/cyclopaediacomm00devegoog). However, it wasn't until Herman Hollerith, whose company would eventually become known as International Business Machines, developed the punched card that data could be harvested at scale. Hollerith initially developed his punched card–processing technology for the 1890 U.S. government census. Later in 1937, the U.S. government contracted IBM to use its punched card–reading machines for a new, massive bookkeeping project that involved 26 million Social Security numbers.

In 1965, the U.S. government built its first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic computer tape. With the advent of the Internet, and later mobile devices and IoT, it became possible for private companies to truly use data at scale, building massive stores of consumer data based on the growing number of touchpoints they now shared with their customers. Taken as an average, data is created at a rate of more than 1.7MB every second for every person (www.domo.com/solution/data-never-sleeps-6). That equates to approximately 154,000,000,000,000 punched cards. By coupling the volume of data with the capacity to meaningfully process that data, data can be used at scale for much more than simple record keeping.

Clearly, our world is firmly in the age of big data. Enterprises are scrambling to integrate capabilities that can address advanced analytics such as artificial intelligence and machine learning in order to best leverage their data. The need to draw out insights to improve business performance in the marketplace is nothing less than mandatory. Recent data management concepts such as the data lake have emerged to help guide enterprises in storing and managing data. In many ways, the data lake was a stark contrast to its forerunner, the enterprise data warehouse (EDW). Typically, the EDW accepted data that had already been deemed useful, and its content was organized in a highly systematic way.

When misused, a data lake serves as nothing more than a hoarding ground for terabytes and petabytes of unstructured and unprocessed data, much of it never to be used. However, a data lake can be meaningfully leveraged for the benefit of advanced analytics and machine learning models.

But, are data warehouses and data lakes serving their intended purpose? More succinctly, are enterprises realizing the business-side benefit of having a place to hoard data?

The global research and advisory firm Gartner has provided sobering analysis. It has estimated that more than half of the enterprise data warehouses that were attempted have been failures and that the new data lake has fared even worse. At one time, Gartner analysts projected that the failure rate of data lakes might reach as high as 60 percent (blogs.gartner.com/nick-heudecker/big-data-challenges-move-from-tech-to-the-organization). However, Gartner has now dismissed that number as being too conservative. Actual failure rates are thought to be much closer to 85 percent (www.infoworld.com/article/3393467/4-reasons-big-data-projects-failand-4-ways-to-succeed.html).

Why have initiatives such as the EDW and the data lake failed so spectacularly? The short answer is that developing a proper information architecture isn't simple.

For much the same reason that the EDW failed, many of the approaches taken by data scientists have failed to recognize the following considerations:

  • The nature of the enterprise
  • The business of the organization
  • The stochastic and potentially gargantuan nature of change
  • The importance of data quality
  • How different techniques applied to schema design and information architecture can affect the organization's readiness for change

Analysis reveals that the higher failure rate for data lakes and big data initiatives has been attributed not to technology itself but, rather, to how the technologists have applied the technology (datazuum.com/5-data-actions-2018/).

These facets become quickly self-evident in conversations with our enterprise clients. In discussing data warehousing and data lakes, the conversation often involves answers such as, “Which one? We have many of each.” It often happens that a department within an organization needs a repository for its data, but their requirements are not satisfied by previous data storage efforts. So instead of attempting to reform or update older data warehouses or lakes, the department creates a new data store. The result is a hodgepodge of data storage solutions that don't always play well together, resulting in lost opportunities for data analysis.

Obviously, new technologies can provide many tangible benefits, but those benefits cannot be realized unless the technologies are deployed and managed with care. Unlike designing a building as in traditional architecture, information architecture is not a set-it-and-forget-it prospect.

While an organization can control how data is ingested, your organization can't always control how the data it needs changes over time. Organizations tend to be fragile in that they can break when circumstances change. Only flexible, adaptive information architectures can adjust to new environmental conditions. Designing and deploying solutions against a moving target is difficult, but the challenge is not insurmountable.

The glib assertion that garbage in will equal garbage out is treated as being passé by many IT professionals. While in truth garbage data has plagued analytics and decision-making for decades, mismanaged data and inconsistent representations will remain a red flag for each AI project you undertake.

The level of data quality demanded by machine learning and deep learning can be significant. Like a coin with two sides, low data quality can have two separate and equally devastating impacts. On the one hand, low-quality data associated with historical data can distort the training of a predictive model. On the other, new data can distort the model and negatively impact decision-making.

As a sharable resource, data is exposed across your organization through layers of services that can behave like a virus when the level of data quality is poor—unilaterally affecting all those who touch the data. Therefore, an information architecture for artificial intelligence must be able to mitigate traditional issues associated with data quality, foster the movement of data, and, when necessary, provide isolation.

The purpose of this book is to provide you with an understanding of how the enterprise must approach the work of building an information architecture in order to make way for successful, sustainable, and scalable AI deployments. The book includes a structured framework and advice that is both practical and actionable toward the goal of implementing an information architecture that's equipped to capitalize on the benefits of AI technologies.

What You'll Learn

We'll begin in Chapter 1, “Climbing the AI Ladder” with a discussion of the AI Ladder, an illustrative device developed by IBM to demonstrate the steps, or rungs, an organization must climb to realize sustainable benefits with the use of AI. From there, Chapters 2, “Framing Part I: Considerations for Organizations Using AI” and Chapter 3, “Framing Part II: Considerations for Working with Data and AI” cover an array of considerations data scientists and IT leaders must be aware of as they traverse their way up the ladder.

In Chapter 4, “A Look Back on Analytics: More Than One Hammer” and Chapter 5, “A Look Forward on Analytics: Not Everything Can Be a Nail,” we'll explore some recent history: data warehouses and how they've given way to data lakes. We'll discuss how data lakes must be designed in terms of topography and topology. This will flow into a deeper dive into data ingestion, governance, storage, processing, access, management, and monitoring.

In Chapter 6, “Addressing Operational Disciplines on the AI Ladder,” we'll discuss how DevOps, DataOps, and MLOps can enable an organization to better use its data in real time. In Chapter 7, “Maximizing the Use of Your Data: Being Value Driven,” we'll delve into the elements of data governance and integrated data management. We'll cover the data value chain and the need for data to be accessible and discoverable in order for the data scientist to determine the data's value.

Chapter 8, “Valuing Data with Statistical Analysis and Enabling Meaningful Access” introduces different approaches for data access, as different roles within the organization will need to interact with data in different ways. The chapter also furthers the discussion of data valuation, with an explanation of how statistics can assist in ranking the value of data.

In Chapter 9, “Constructing for the Long-Term,“ we'll discuss some of the things that can go wrong in an information architecture and the importance of data literacy across the organization to prevent such issues.

Finally, Chapter 10, “A Journey's End: An IA for AI” will bring everything together with a detailed overview of developing an information architecture for artificial intelligence (IA for AI). This chapter provides practical, actionable steps that will bring the preceding theoretical backdrop to bear on real-world information architecture development.