Cover Page

Information Quality

The Potential of Data and Analytics to Generate Knowledge

 

 

Ron S. Kenett

KPA, Israel and University of Turin, Italy

Galit Shmueli

National Tsing Hua University, Taiwan

 

 

 

 

 

 

logo.gif

 

 

 

To Sima; our children Dolav, Ariel, Dror, and Yoed; and their families and especially their children, Yonatan, Alma, Tomer, Yadin, Aviv, Gili, Matan, and Eden, they are my source of pride and motivation.

And to the memory of my dear friend, Roberto Corradetti, who dedicated his career to applied statistics.

RSK

To my family, mentors, colleagues, and students who’ve sparked and nurtured the creation of new knowledge and innovative thinking

GS

Foreword

I am often invited to assess research proposals. Included amongst the questions I have to ask myself in such assessments are: Are the goals stated sufficiently clearly? Does the study have a good chance of achieving the stated goals? Will the researchers be able to obtain sufficient quality data for the project? Are the analysis methods adequate to answer the questions? And so on. These questions are fundamental, not merely for research proposals, but for any empirical study – for any study aimed at extracting useful information from evidence or data. And yet they are rarely overtly stated. They tend to lurk in the background, with the capability of springing into the foreground to bite those who failed to think them through.

These questions are precisely the sorts of questions addressed by the InfoQ – Information Quality – framework. Answering such questions allows funding bodies, corporations, national statistical institutes, and other organisations to rank proposals, balance costs against success probability, and also to identify the weaknesses and hence improve proposals and their chance of yielding useful and valuable information. In a context of increasing constraints on financial resources, it is critical that money is well spent, so that maximising the chance that studies will obtain useful information is becoming more and more important. The InfoQ framework provides a structure for maximising these chances.

A glance at the statistics shelves of any technical library will reveal that most books focus narrowly on the details of data analytic methods. The same is true of almost all statistics teaching. This is all very well – it is certainly vital that such material be covered. After all, without an understanding of the basic tools, no analysis, no knowledge extraction would be possible. But such a narrow focus typically fails to place such work in the broader context, without which its chances of success are damaged. This volume will help to rectify that oversight. It will provide readers with insight into and understanding of other key parts of empirical analysis, parts which are vital if studies are to yield valid, accurate, and useful conclusions.

But the book goes beyond merely providing a framework. It also delves into the details of these overlooked aspects of data analysis. It discusses the fact that the same data may be high quality for one purpose and low for another, and that the adequacy of an analysis depends on the data and the goal, as well as depending on other less obvious aspects, such as the accessibility, completeness, and confidentiality of the data. And it illustrates the ideas with a series of illuminating applications.

With computers increasingly taking on the mechanical burden of data analytics the opportunities are becoming greater for us to shift our attention to the higher order aspects of analysis: to precise formulation of the questions, to consideration of data quality to answer those questions, to choice of the best method for the aims, taking account of the entire context of the analysis. In doing so we improve the quality of the conclusions we reach. And this, in turn, leads to improved decisions ‐ for researchers, policy makers, managers, and others. This book will provide an important tool in this process.

David J. Hand

Imperial College London

About the authors

Ron S. Kenett is chairman of the KPA Group; research professor, University of Turin, Italy; visiting professor at the Hebrew University Institute for Drug Research, Jerusalem, Israel and at the Faculty of Economics, Ljubljana University, Slovenia. He is past president of the Israel Statistical Association (ISA) and of the European Network for Business and Industrial Statistics (ENBIS). Ron authored and coauthored over 200 papers and 12 books on topics ranging from industrial statistics, customer surveys, multivariate quality control, risk management, biostatistics and statistical methods in healthcare to performance appraisal systems and integrated management models. The KPA Group he formed in 1990 is a leading Israeli firm focused on generating insights through analytics with international customers such as hp, 3M, Teva, Perrigo, Roche, Intel, Amdocs, Stratasys, Israel Aircraft Industries, the Israel Electricity Corporation, ICL, start‐ups, banks, and healthcare providers. He was awarded the 2013 Greenfield Medal by the Royal Statistical Society in recognition for excellence in contributions to the applications of statistics. Among his many activities he is member of the National Public Advisory Council for Statistics Israel; member of the Executive Academic Council, Wingate Academic College; and board member of several pharmaceutical and Internet product companies.

Galit Shmueli is distinguished professor at National Tsing Hua University’s Institute of Service Science. She is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored and coauthored over 70 journal articles, book chapters, books, and textbooks, including Data Mining for Business Analytics, Modeling Online Auctions and Getting Started with Business Analytics. Her research is published in top journals in statistics, management, marketing, information systems, and more. Professor Shmueli has designed and instructed business analytics courses and programs since 2004 at the University of Maryland, the Indian School of Business, Statistics.com, and National Tsing Hua University, Taiwan. She has also taught engineering statistics courses at the Israel Institute of Technology and at Carnegie Mellon University.

Preface

This book is about a strategic and tactical approach to data analysis where providing added value by turning numbers into insights is the main goal of an empirical study. In our long‐time experience as applied statisticians and data mining researchers (“data scientists”), we focused on developing methods for data analysis and applying them to real problems. Our experience has been, however, that data analysis is part of a bigger process that begins with problem elicitation that consists of defining unstructured problems and ends with decisions on action items and interventions that reflect on the true impact of a study.

In 2006, the first author published a paper on the statistical education bias where, typically, in courses on statistics and data analytics, only statistical methods are taught, without reference to the statistical analysis process (Kenett and Thyregod, 2006).

In 2010, the second author published a paper showing the differences between statistical modeling aimed at prediction goals versus modeling designed to explain causal effects (Shmueli, 2010), the implication being that the goal of a study should affect the way a study is performed, from data collection to data pre‐processing, exploration, modeling, validation, and deployment. A related paper (Shmueli and Koppius, 2011) focused on the role of predictive analytics in theory building and scientific development in the explanatory‐dominated social sciences and management research fields.

In 2014, we published “On Information Quality” (Kenett and Shmueli, 2014), a paper designed to lay out the foundation for a holistic approach to data analysis (using statistical modeling, data mining approaches, or any other data analysis methods) by structuring the main ingredients of what turns numbers into information. We called the approach information quality (InfoQ) and identified four InfoQ components and eight InfoQ dimensions.

Our main thesis is that data analysis, and especially the fields of statistics and data science, need to adapt to modern challenges and technologies by developing structured methods that provide a broad life cycle view, that is, from numbers to insights. This life cycle view needs to be focused on generating InfoQ as a key objective (for more on this see Kenett, 2015).

This book, Information Quality: The Potential of Data and Analytics to Generate Knowledge, offers an extensive treatment of InfoQ and the InfoQ framework. It is aimed at motivating researchers to further develop InfoQ elements and at students in programs that teach them how to make sure their analytic or statistical work is generating information of high quality.

Addressing this mixed community has been a challenge. On the one hand, we wanted to provide academic considerations, and on the other hand, we wanted to present examples and cases that motivate students and practitioners and give them guidance in their own specific projects.

We try to achieve this mix of objectives by combining Part I, which is mostly methodological, with Part II which is based on examples and case studies.

In Part III, we treat additional topics relevant to InfoQ such as reproducible research, the review of scientific and applied research publications, the incorporation of InfoQ in academic and professional development programs, and how three leading software platforms, R, MINITAB, and JMP support InfoQ implementations.

Researchers interested in applied statistics methods and strategies will most likely start in Part I and then move to Part II to see illustrations of the InfoQ framework applied in different domains. Practitioners and students learning how to turn numbers into information can start in a relevant chapter of Part II and move back to Part I.

A teacher or designer of a course on data analytics, applied statistics, or data science can build on examples in Part II and consolidate the approach by covering Chapter 13 and the chapters in Part I. Chapter 13 on “Integrating InfoQ into data science analytics programs, research methods courses and more” was specially prepared for this audience. We also developed five case studies that can be used by teachers as a rating‐based InfoQ assessment exercise (available at http://infoq.galitshmueli.com/class‐assignment).

In developing InfoQ, we received generous inputs from many people. In particular, we would like to acknowledge insightful comments by Sir David Cox, Shelley Zacks, Benny Kedem, Shirley Coleman, David Banks, Bill Woodall, Ron Snee, Peter Bruce, Shawndra Hill, Christine Anderson Cook, Ray Chambers, Fritz Sheuren, Ernest Foreman, Philip Stark, and David Steinberg. The motivation to apply InfoQ to the review of papers (Chapter 12) came from a comment by Ross Sparks who wrote to us: “I really like your framework for evaluating information quality and I have started to use it to assess papers that I am asked to review. Particularly applied papers.” In preparing the material, we benefited from comprehensive editorial inputs by Raquelle Azran and Noa Shmueli who generously provided us their invaluable expertise—we would like to thank them and recognize their help in improving the text language and style.

The last three chapters were contributed by colleagues. They create a bridge between theory and practice showing how InfoQ is supported by R, MINITAB, and JMP. We thank the authors of these chapters, Silvia Salini, Federica Cugnata, Elena Siletti, Ian Cox, Pere Grima, Lluis Marco‐Almagro, and Xavier Tort‐Martorell, for their effort, which helped make this work both theoretical and practical.

We are especially thankful to Professor David J. Hand for preparing the foreword of the book. David has been a source of inspiration to us for many years and his contribution highlights the key parts of our work.

In the course of writing this book and developing the InfoQ framework, the first author benefited from numerous discussions with colleagues at the University of Turin, in particular with a great visionary of the role of applied statistics in modern business and industry, the late Professor Roberto Corradetti. Roberto has been a close friend and has greatly influenced this work by continuously emphasizing the need for statistical work to be appreciated by its customers in business and industry. In addition, the financial support of the Diego de Castro Foundation that he managed has provided the time to work in a stimulating academic environment at both the Faculty of Economics and the “Giuseppe Peano” Department of Mathematics of UNITO, the University of Turin. The contributions of Roberto Corradetti cannot be underestimated and are humbly acknowledged. Roberto passed away in June 2015 and left behind a great void. The second author thanks participants of the 2015 Statistical Challenges in eCommerce Research Symposium, where she presented the keynote address on InfoQ, for their feedback and enthusiasm regarding the importance of the InfoQ framework to current social science and management research.

Finally we acknowledge with pleasure the professional help of the Wiley personnel including Heather Kay, Alison Oliver and Adalfin Jayasingh and thank them for their encouragements, comments, and input that were instrumental in improving the form and content of the book.

Ron S. Kenett and Galit Shmueli

References

  1. Kenett, R.S. (2015) Statistics: a life cycle view (with discussion). Quality Engineering, 27(1), pp. 111–129.
  2. Kenett, R.S. and Shmueli, G. (2014) On information quality (with discussion). Journal of the Royal Statistical Society, Series A, 177(1), pp. 3–38.
  3. Kenett, R.S. and Thyregod, P. (2006) Aspects of statistical consulting not taught by academia. Statistica Neerlandica, 60(3), pp. 396–412.
  4. Shmueli, G. (2010) To explain or to predict? Statistical Science, 25, pp. 289–310.
  5. Shmueli, G. and Koppius, O.R. (2011) Predictive analytics in information systems research. MIS Quarterly, 35(3), pp. 553–572.

Quotes about the book

What experts say about Information Quality: The Potential of Data and Analytics to Generate Knowledge:

A glance at the statistics shelves of any technical library will reveal that most books focus narrowly on the details of data analytic methods. The same is true of almost all statistics teaching. This volume will help to rectify that oversight. It will provide readers with insight into and understanding of other key parts of empirical analysis, parts which are vital if studies are to yield valid, accurate, and useful conclusions.

David Hand

Imperial College, London, UK

There is an important distinction between data and information. Data become information only when they serve to inform, but what is the potential of data to inform? With the work Kenett and Shmueli have done, we now have a general framework to answer that question. This framework is relevant to the whole analysis process, showing the potential to achieve higher‐quality information at each step.

John Sall

SAS Institute, Cary, NC, USA

The authors have a rare quality: being able to present deep thoughts and sound approaches in a way practitioners can feel comfortable and understand when reading their work and, at the same time, researchers are compelled to think about how they do their work.

Fabrizio Ruggeri

Consiglio Nazionale delle Ricerche
Istituto di Matematica Applicata e Tecnologie Informatiche, Milan, Italy

No amount of technique can make irrelevant data fit for purpose, eliminate unknown biases, or compensate for data paucity. Useful, reliable inferences require balancing real‐world and theoretical considerations and recognizing that goals, data, analysis, and costs are necessarily connected. Too often, books on statistics and data analysis put formulae in the limelight at the expense of more important questions about the relevance and limitations of data and the purpose of the analysis. This book elevates these crucial issues to their proper place and provides a systematic structure (and examples) to help practitioners see the larger context of statistical questions and, thus, to do more valuable work.

Phillip Stark

University of California, Berkeley, USA

…the “Q” issue is front and centre for anyone (or any agency) hoping to benefit from the data tsunami that is said to be driving things now … And so the book will be very timely.

Ray Chambers

University of Wollongong, Australia

Kenett and Shmueli shed light on the biggest contributor to erroneous conclusions in research ‐ poor information quality coming out of a study. This issue ‐ made worse by the advent of Big Data ‐ has received too little attention in the literature and the classroom. Information quality issues can completely undermine the utility and credibility of a study, yet researchers typically deal with it in an ad‐hoc, offhand fashion, often when it is too late. Information Quality offers a sensible framework for ensuring that the data going into a study can effectively answer the questions being asked.

Peter Bruce

The Institute for Statistics Education

Policy makers rely on high quality and relevant data to make decisions and it is important that, as more and different types of data become available, we are mindful of all aspects of the quality of the information provided. This includes not only statistical quality, but other dimensions as outlined in this book including, very importantly, whether the data and analyses answer the relevant questions

John Pullinger

National Statistician, UK Statistics Authority, London, UK

This impressive book fills a gap in the teaching of statistical methodology. It deals with a neglected topic in statistical textbooks: the quality of the information provided by the producers of statistical projects and used by the customers of statistical data from surveys, administrative data etc. The emphasis in the book on: defining, discussing, analyzing the goal of the project at a preliminary stage and not less important at the analysis stage and use of the results obtained is of a major importance.

Moshe Sikron

Former Government Statistician of Israel, Jerusalem, Israel

Ron Kenett and Galit Shmueli belong to a class of practitioners who go beyond methodological prowess into questioning what purpose should be served by a data based analysis, and what could be done to gauge the fitness of the analysis to meet its purpose. This kind of insight is all the more urgent given the present climate of controversy surrounding science’s own quality control mechanism. In fact science used in support to economic or policy decision – be it natural or social science ‐ has an evident sore point precisely in the sort of statistical and mathematical modelling where the approach they advocate – Information Quality or InfoQ – is more needed. A full chapter is specifically devoted to the contribution InfoQ can make to clarify aspect of reproducibility, repeatability, and replicability of scientific research and publications. InfoQ is an empirical and flexible construct with practically infinite application in data analysis. In a context of policy, one can deploy InfoQ to compare different evidential bases pro or against a policy, or different options in an impact assessment case. InfoQ is a holistic construct encompassing the data, the method and the goal of the analysis. It goes beyond the dimensions of data quality met in official statistics and resemble more holistic concepts of performance such as analysis pedigrees (NUSAP) and sensitivity auditing. Thus InfoQ includes consideration of analysis’ Generalizability and Action Operationalization. The latter include both action operationalization (to what extent concrete actions can be derived from the information provided by a study) and construct operationalization (to what extent a construct under analysis is effectively captured by the selected variables for a given goal). A desirable feature of InfoQ is that it demands multidisciplinary skills, which may force statisticians to move out of their comfort zone into the real world. The book illustrates the eight dimensions of InfoQ with a wealth of examples. A recommended read for applied statisticians and econometricians who care about the implications of their work.

Andrea Saltelli

European Centre for Governance in Complexity

Kenett and Shmueli have made a significant contribution to the profession by drawing attention to what is frequently the most important but overlooked aspect of analytics; information quality. For example, statistics textbooks too often assume that data consist of random samples and are measured without error, and data science competitions implicitly assume that massive data sets contain high‐quality data and are exactly the data needed for the problem at hand. In reality, of course, random samples are the exception rather than the rule, and many data sets, even very large ones, are not worth the effort required to analyze them. Analytics is akin to mining, not to alchemy; the methods can only extract what is there to begin with. Kenett and Shmueli made clear the point that obtaining good data typically requires significant effort. Fortunately, they present metrics to help analysts understand the limitations of the information in hand, and how to improve it going forward. Kudos to the authors for this important contribution.

Roger Hoerl

Union College, Schenectady, NY USA

About the companion website

Don’t forget to visit the companion website for this book:

www.wiley.com/go/information_quality

Here you will find valuable material designed to enhance your learning, including:

flastg01

  1. The JMP add‐in presented in Chapter 16
  2. Five case studies that can be used as exercises of InfoQ assessment
  3. A set of presentations on InfoQ

Scan this QR code to visit the companion website.

flastf01

Part I
THE INFORMATION QUALITY FRAMEWORK