Cover
Title Page
Foreword
About the authors
Preface
1. References
Quotes about the book
About the companion website
Part I: THE INFORMATION QUALITY FRAMEWORK
1. 1 Introduction to information quality
  1. 1.1 Introduction
  2. 1.2 Components of InfoQ
  3. 1.3 Definition of information quality
  4. 1.4 Examples from online auction studies
  5. 1.5 InfoQ and study quality
  6. 1.6 Summary
  7. References
2. 2 Quality of goal, data quality, and analysis quality
  1. 2.1 Introduction
  2. 2.2 Data quality
  3. 2.3 Analysis quality
  4. 2.4 Quality of utility
  5. 2.5 Summary
  6. References
3. 3 Dimensions of information quality and InfoQ assessment
  1. 3.1 Introduction
  2. 3.2 The eight dimensions of InfoQ
  3. 3.3 Assessing InfoQ
  4. 3.4 Example: InfoQ assessment of online auction experimental data
  5. 3.5 Summary
  6. References
4. 4 InfoQ at the study design stage
  1. 4.1 Introduction
  2. 4.2 Primary versus secondary data and experiments versus observational data
  3. 4.3 Statistical design of experiments
  4. 4.4 Clinical trials and experiments with human subjects
  5. 4.5 Design of observational studies: Survey sampling
  6. 4.6 Computer experiments (simulations)
  7. 4.7 Multiobjective studies
  8. 4.8 Summary
  9. References
5. 5 InfoQ at the postdata collection stage
  1. 5.1 Introduction
  2. 5.2 Postdata collection data
  3. 5.3 Data cleaning and preprocessing
  4. 5.4 Reweighting and bias adjustment
  5. 5.5 Meta‐analysis
  6. 5.6 Retrospective experimental design analysis
  7. 5.7 Models that account for data “loss”: Censoring and truncation
  8. 5.8 Summary
  9. References
Part II: APPLICATIONS OF InfoQ
1. 6 Education
  1. 6.1 Introduction
  2. 6.2 Test scores in schools
  3. 6.3 Value‐added models for educational assessment
  4. 6.4 Assessing understanding of concepts
  5. 6.5 Summary
  6. Appendix: MERLO implementation for an introduction to statistics course
  7. References
2. 7 Customer surveys
  1. 7.1 Introduction
  2. 7.2 Design of customer surveys
  3. 7.3 InfoQ components
  4. 7.4 Models for customer survey data analysis
  5. 7.5 InfoQ evaluation
  6. 7.6 Summary
  7. Appendix: A posteriori InfoQ improvement for survey nonresponse selection bias
  8. References
3. 8 Healthcare
  1. 8.1 Introduction
  2. 8.2 Institute of medicine reports
  3. 8.3 Sant’Anna di Pisa report on the Tuscany healthcare system
  4. 8.4 The haemodialysis case study
  5. 8.5 The Geriatric Medical Center case study
  6. 8.6 Report of cancer incidence cluster
  7. 8.7 Summary
  8. References
4. 9 Risk management
  1. 9.1 Introduction
  2. 9.2 Financial engineering, risk management, and Taleb’s quadrant
  3. 9.3 Risk management of OSS
  4. 9.4 Risk management of a telecommunication system supplier
  5. 9.5 Risk management in enterprise system implementation
  6. 9.6 Summary
  7. References
5. 10 Official statistics
  1. 10.1 Introduction
  2. 10.2 Information quality and official statistics
  3. 10.3 Quality standards for official statistics
  4. 10.4 Standards for customer surveys
  5. 10.5 Integrating official statistics with administrative data for enhanced InfoQ
  6. 10.6 Summary
  7. References
Part III: IMPLEMENTING InfoQ
1. 11 InfoQ and reproducible research
  1. 11.1 Introduction
  2. 11.2 Definitions of reproducibility, repeatability, and replicability
  3. 11.3 Reproducibility and repeatability in GR&R
  4. 11.4 Reproducibility and repeatability in animal behavior studies
  5. 11.5 Replicability in genome‐wide association studies
  6. 11.6 Reproducibility, repeatability, and replicability: the InfoQ lens
  7. 11.7 Summary
  8. Appendix: Gauge repeatability and reproducibility study design and analysis
  9. References
2. 12 InfoQ in review processes of scientific publications
  1. 12.1 Introduction
  2. 12.2 Current guidelines in applied journals
  3. 12.3 InfoQ guidelines for reviewers
  4. 12.4 Summary
  5. References
3. 13 Integrating InfoQ into data science analytics programs, research methods courses, and more
  1. 13.1 Introduction
  2. 13.2 Experience from InfoQ integrations in existing courses
  3. 13.3 InfoQ as an integrating theme in analytics programs
  4. 13.4 Designing a new analytics course (or redesigning an existing course)
  5. 13.5 A one‐day InfoQ workshop
  6. 13.6 Summary
  7. Acknowledgements
  8. References
4. 14 InfoQ support with R
  1. 14.1 Introduction
  2. 14.2 Examples of information quality with R
  3. 14.3 Components and dimensions of InfoQ and R
  4. 14.4 Summary
  5. References
5. 15 InfoQ support with Minitab
  1. 15.1 Introduction
  2. 15.2 Components and dimensions of InfoQ and Minitab
  3. 15.3 Examples of InfoQ with Minitab
  4. 15.4 Summary
  5. References
6. 16 InfoQ support with JMP
  1. 16.1 Introduction
  2. 16.2 Example 1: Controlling a film deposition process
  3. 16.3 Example 2: Predicting water quality in the Savannah River Basin
  4. 16.4 A JMP application to score the InfoQ dimensions
  5. 16.5 JMP capabilities and InfoQ
  6. 16.6 Summary
  7. References
Index
End User License Agreement

List of Tables

Chapter 04
1. Table 4.1 Statistical strategies for increasing InfoQ given a priori causes at the design stage.
Chapter 05
1. Table 5.1 Statistical strategies for increasing InfoQ given a posteriori causes at the postdata collection stage and approaches for increasing InfoQ.
Chapter 06
1. Table 6.1 InfoQ assessment for MAP report.
2. Table 6.2 InfoQ assessment for student’s lifelong earning study.
3. Table 6.3 InfoQ assessment for VAM (based on ASA statement).
4. Table 6.4 MERLO recognition scores for ten concepts taught in an Italian middle school.
5. Table 6.5 Grouping of MERLO recognition scores using the Tukey method and 95% confidence.
6. Table 6.6 InfoQ assessment for MERLO.
7. Table 6.7 Scoring of InfoQ dimensions of examples from education.
Chapter 07
1. Table 7.1 Main deliverables in an Internet‐based ACSS project.
2. Table 7.2 Service level agreements for Internet‐based customer satisfaction surveys.
3. Table 7.3 A typical ACSS activity plan.
4. Table 7.4 InfoQ score of various models used in the analysis of customer surveys.
5. Table A Postdata collection correction for nonresponse bias in a customer satisfaction survey using adjusted residuals.
Chapter 08
1. Table 8.1 InfoQ components for IOM‐related studies.
2. Table 8.2 InfoQ dimensions and ratings for Stelfox et al. (2006) data and for the IOM reports.
3. Table 8.3 InfoQ components for Sant’Anna di Pisa study.
4. Table 8.4 InfoQ dimensions and ratings on 5‐point scale for Sant’Anna di Pisa study.
5. Table 8.5 InfoQ components for the haemodialysis decision support system.
6. Table 8.6 Marginal posterior distributions for the j‐th patient’s risk profile (True = risk has materialized).
7. Table 8.7 Posterior distributions of outcome measures for two patients.
8. Table 8.8 InfoQ dimensions and ratings on 5‐point scale for haemodialysis study.
9. Table 8.9 InfoQ components for the two NataGer projects data.
10. Table 8.10 InfoQ dimensions and ratings on 5‐point scale for the two NataGer projects.
11. Table 8.11 InfoQ components of cancer incidence report.
12. Table 8.12 InfoQ dimensions and ratings of cancer incidence study by Rottenberg et al. (2013).
13. Table 8.13 Scoring of InfoQ dimensions for each of the four healthcare cases studies.
Chapter 09
1. Table 9.1 Log of technicians’ on‐site interventions (techdb).
2. Table 9.2 Balance sheet indicators for a given costumer of the VNO (balance).
3. Table 9.3 Classification of 264 CEAO chains by aspect and division (output from MINITAB version 12.1).
4. Table 9.4 Scoring of InfoQ dimensions of the five risk management cases studies.
Chapter 10
1. Table 10.1 Relationship between NCSES standards and InfoQ dimensions. Shaded cells indicate an existing relationship.
2. Table 10.2 Relationship between ISO 10004 guidelines and InfoQ dimensions. Shaded cells indicate an existing relationship.
3. Table 10.3 Scores for InfoQ dimensions for Stella education case study.
4. Table 10.4 Scores for InfoQ dimensions for the NHTSA safety case study.
Chapter 11
1. Table 11.1 Terminology in GR&R studies.
2. Table 11.2 Terminology in animal experiments.
3. Table 11.3 Terminology in genome‐wide association studies.
4. Table A ANOVA table of GR&R experiments.
Chapter 12
1. Table 12.1 List of journals published by American Statistical Association (ASA) Referee guidelines web pages were not found for any of these journals.
2. Table 12.2 Partial list of journals published by American Society for Quality (ASQ) Referee guidelines web pages were not found for any of these journals. The same lack of guidelines applies to all other ASQ journals (http://asq.org/pub/).
3. Table 12.3 List of journals published by the Institute of Mathematical Statistics (IMS) and URLs for referee guidelines (accessed July 7, 2014).
4. Table 12.4 List of journals published by the Royal Statistical Society (RSS) and URLs for referee guidelines (accessed July 7, 2014).
5. Table 12.5 List of journals in machine learning and URLs for referee guidelines (accessed July 7, 2014).
6. Table 12.6 Reviewing guidelines for major data mining conference (accessed July 7, 2014).
7. Table 12.7 List of top scientific journals and URLs for referee guidelines (accessed July 7, 2014).
8. Table 12.8 Questionnaire for reviewers of applied research submission.
Chapter 15
1. Table 15.1 InfoQ assessment for Example 1.
2. Table 15.2 Results of the factorial experimental design of the steering wheels.
3. Table 15.3 InfoQ assessment for Example 2.
Chapter 16
1. Table 16.1 Synopsis of Example 1.
2. Table 16.2 InfoQ assessment for Example 1.
3. Table 16.3 Synopsis of Example 2.
4. Table 16.4 Ys for the PLS model.
5. Table 16.5 InfoQ assessment for Example 2.

List of Illustrations

Chapter 01
1. Figure 1.1 The four InfoQ components.
2. Figure 1.2 Price curves for the last day of four seven‐day auctions (x‐axis denotes day of auction). Current auction price (line with circles), functional price curve (smooth line) and forecasted price curve (broken line).
Chapter 03
1. Figure 3.1 Timeline of study, from data collection to study deployment.
Chapter 04
1. Figure 4.1 JMP screenshot of a 2⁷⁻³ fractional factorial experiment with the piston simulator described in Kenett and Zacks (2014).
2. Figure 4.2 JMP screenshot of a definitive screening design experiment with the piston simulator described in Kenett and Zacks (2014).
3. Figure 4.3 JMP screenshot of fraction of design space plots and design diagnostics of fractional (left) and definite screening designs (right).
Chapter 05
1. Figure 5.1 Illustration of right, left, and interval censoring. Each line denotes the lifetime of the observation.
Chapter 06
1. Figure 6.1 The Missouri Assessment Program test report for fictional student Sara Armstrong.
2. Figure 6.2 SAT Critical Reading skills.
3. Figure 6.3 Earning per teacher value‐added score.
4. Figure 6.4 Test scores by school by high value‐added teacher score.
5. Figure 6.5 Template for constructing an item family in MERLO.
6. Figure 6.6 Example of MERLO item (mathematics/functions).
7. Figure 6.7 Box plots of MERLO recognition scores in ten mathematical topics taught in an Italian middle school. Asterisks represent outliers beyond three standard deviation of mean.
8. Figure 6.8 Confidence intervals for difference in MERLO recognition scores between topics.
Chapter 07
1. Figure 7.1 SERVQUAL gap model.
2. Figure 7.2 Bayesian network of responses to satisfaction questions from various topics, overall satisfaction, repurchasing intentions, recommendation level, and country of respondent.
Chapter 08
1. Figure 8.1 Bayesian network of patient haemodialysis treatment.
2. Figure 8.2 Visual board display designed to help reduce patients’ falls.
3. Figure 8.3 Prioritization tool for potential causes for bedsores occurrence.
Chapter 09
1. Figure 9.1 Bayesian network linking risk drivers with the activeness of risk indicators.
2. Figure 9.2 Social network based on email communication between OSS contributors and committers.
3. Figure 9.3 Simplex representation of association rules of event categories in telecom case study.
4. Figure 9.4 A sample CEAO chain.
5. Figure 9.5 Correspondence analysis of CEAO chains in five divisions by aspect. K&S = knowledge and skills; Mgmt = management; P = process; S = structure; S&G = strategy and goals; SD = social dynamics.
Chapter 10
1. Figure 10.1 BN for the Stella dataset.
2. Figure 10.2 BN is conditioned on a value of lastsal which is similar to the salary value of the Graduates dataset.
3. Figure 10.3 BN is conditioned on a low value of begsal and emp and for a high value of yPhD.
4. Figure 10.4 BN for the Graduates dataset.
5. Figure 10.5 BN is conditioned on a high value of msalary.
6. Figure 10.6 BN is conditioned on a high value of mdipl and nemp and for a low value of ystjob.
7. Figure 10.7 BN for the Vehicle Safety dataset.
8. Figure 10.8 BN for the Crash Test dataset.
9. Figure 10.9 BN for the Crash Test dataset is conditioned on a high value of Wt and Year.
10. Figure 10.10 BN for the Crash Test dataset is conditioned on a low value of Wt and Year.
Chapter 13
1. Figure 13.1 Google Trends data on “data science course.”
2. Figure 13.2 InfoQ evaluation form for an empirical study on air quality. The complete information and additional studies for evaluation are available at goo.gl/erNPF.
Chapter 14
1. Figure 14.1 An example of RStudio window.
2. Figure 14.2 An example of R Commander window.
3. Figure 14.3 Wordclouds for the two datasets.
4. Figure 14.4 Comparison (Expo 2015 = dark, Expo 2020 = light) and commonality clouds.
5. Figure 14.5 ExpoBarometro results.
6. Figure 14.6 SensoMineR menu in Excel.
7. Figure 14.7 Assessment of the performance of the panel with the panelperf() and coltable() functions.
8. Figure 14.8 Representation of the perfumes and the sensory attributes on the first two dimensions resulting from PCA() on adjusted means of ANOVA models.
9. Figure 14.9 Representation of the perfumes on the first two dimensions resulting from PCA() in which each product is associated with a confidence ellipse.
10. Figure 14.10 Representation of the perfumes and the sensory attributes on the first two dimensions resulting from MFA() of both experts and consumers data.
11. Figure 14.11 Visualization of the hedonic scores given by the panelists.
12. Figure 14.12 Nights spent in tourist accommodation establishments by NUTS level 2 region, 2013 (million nights spent by residents and nonresidents).
13. Figure 14.13 Bayesian network.
14. Figure 14.14 Distribution of the overall satisfaction for each level of each variable.
Chapter 15
1. Figure 15.1 Minitab user interface, with session and worksheet windows.
2. Figure 15.2 Some menu options for basic statistical analysis and quality tools.
3. Figure 15.3 A screenshot of Minitab help.
4. Figure 15.4 A histogram (left) and its corresponding stem‐and‐leaf graph (right), of heartbeats per minute of students in a class.
5. Figure 15.5 An example of a DDE connection between Excel and Minitab.
6. Figure 15.6 An example of the number of defects during a month, showed in a time series plot.
7. Figure 15.7 An example of a Pareto chart with all the data together (top) and stratifying by month (bottom).
8. Figure 15.8 A screenshot showing different types of control charts in Minitab.
9. Figure 15.9 A screenshot with different modeling possibilities in Minitab.
10. Figure 15.10 Output from the power and sample size procedure for the comparison of means test.
11. Figure 15.11 The menu option for a Gage R&R study to validate the measurement system in Minitab.
12. Figure 15.12 Representation of results (using histograms) in the case study of the bakery.
13. Figure 15.13 Schematic representation of the data collection procedure for the glass bottles case study.
14. Figure 15.14 Representation of results (using a multivari chart) in the case study of the glass bottles.
15. Figure 15.15 Matrix plot of all variables in the power plant case study.
16. Figure 15.16 Scatterplot of yield versus power (with outlier) in the power plant case study.
17. Figure 15.17 Scatterplot of yield versus power (without outlier) in the power plant case study.
18. Figure 15.18 Dotplot of factor form in the power plant case study.
19. Figure 15.19 Dotplot of the logarithm of factor form in the power plant case study.
20. Figure 15.20 Interaction plot for pressure and temperature in the steering wheels case study.
21. Figure 15.21 Normal probability plot of the effects in the steering wheels case study.
22. Figure 15.22 Interaction plot for ratio and weather in the steering wheels case study.
Chapter 16
1. Figure 16.1 Statistical discovery.
2. Figure 16.2 The LPCVD data (partial view).
3. Figure 16.3 Pattern of missing thickness data.
4. Figure 16.4 Map of all the thickness values.
5. Figure 16.5 XBar‐R chart of film thickness.
6. Figure 16.6 Three‐way chart of film thickness.
7. Figure 16.7 The water quality data.
8. Figure 16.8 Field stations in the Savannah River Basin.
9. Figure 16.9 Bivariate correlation of Ys.
10. Figure 16.10 The PLS personality of fit model.
11. Figure 16.11 Fitting and comparing multiple PLS models.
12. Figure 16.12 The dual role of terms in a PLS model.
13. Figure 16.13 Interactively profiling four Ys in the space of 12 Xs.
14. Figure 16.14 Prediction accuracy of the final PLS model for test data.
15. Figure 16.15 InfoQ assessment of Example 2 with uncertainty.

Foreword

I am often invited to assess research proposals. Included amongst the questions I have to ask myself in such assessments are: Are the goals stated sufficiently clearly? Does the study have a good chance of achieving the stated goals? Will the researchers be able to obtain sufficient quality data for the project? Are the analysis methods adequate to answer the questions? And so on. These questions are fundamental, not merely for research proposals, but for any empirical study – for any study aimed at extracting useful information from evidence or data. And yet they are rarely overtly stated. They tend to lurk in the background, with the capability of springing into the foreground to bite those who failed to think them through.

These questions are precisely the sorts of questions addressed by the InfoQ – Information Quality – framework. Answering such questions allows funding bodies, corporations, national statistical institutes, and other organisations to rank proposals, balance costs against success probability, and also to identify the weaknesses and hence improve proposals and their chance of yielding useful and valuable information. In a context of increasing constraints on financial resources, it is critical that money is well spent, so that maximising the chance that studies will obtain useful information is becoming more and more important. The InfoQ framework provides a structure for maximising these chances.

A glance at the statistics shelves of any technical library will reveal that most books focus narrowly on the details of data analytic methods. The same is true of almost all statistics teaching. This is all very well – it is certainly vital that such material be covered. After all, without an understanding of the basic tools, no analysis, no knowledge extraction would be possible. But such a narrow focus typically fails to place such work in the broader context, without which its chances of success are damaged. This volume will help to rectify that oversight. It will provide readers with insight into and understanding of other key parts of empirical analysis, parts which are vital if studies are to yield valid, accurate, and useful conclusions.

But the book goes beyond merely providing a framework. It also delves into the details of these overlooked aspects of data analysis. It discusses the fact that the same data may be high quality for one purpose and low for another, and that the adequacy of an analysis depends on the data and the goal, as well as depending on other less obvious aspects, such as the accessibility, completeness, and confidentiality of the data. And it illustrates the ideas with a series of illuminating applications.

With computers increasingly taking on the mechanical burden of data analytics the opportunities are becoming greater for us to shift our attention to the higher order aspects of analysis: to precise formulation of the questions, to consideration of data quality to answer those questions, to choice of the best method for the aims, taking account of the entire context of the analysis. In doing so we improve the quality of the conclusions we reach. And this, in turn, leads to improved decisions ‐ for researchers, policy makers, managers, and others. This book will provide an important tool in this process.

David J. Hand

Imperial College London

Preface

This book is about a strategic and tactical approach to data analysis where providing added value by turning numbers into insights is the main goal of an empirical study. In our long‐time experience as applied statisticians and data mining researchers (“data scientists”), we focused on developing methods for data analysis and applying them to real problems. Our experience has been, however, that data analysis is part of a bigger process that begins with problem elicitation that consists of defining unstructured problems and ends with decisions on action items and interventions that reflect on the true impact of a study.

In 2006, the first author published a paper on the statistical education bias where, typically, in courses on statistics and data analytics, only statistical methods are taught, without reference to the statistical analysis process (Kenett and Thyregod, 2006).

In 2010, the second author published a paper showing the differences between statistical modeling aimed at prediction goals versus modeling designed to explain causal effects (Shmueli, 2010), the implication being that the goal of a study should affect the way a study is performed, from data collection to data pre‐processing, exploration, modeling, validation, and deployment. A related paper (Shmueli and Koppius, 2011) focused on the role of predictive analytics in theory building and scientific development in the explanatory‐dominated social sciences and management research fields.

In 2014, we published “On Information Quality” (Kenett and Shmueli, 2014), a paper designed to lay out the foundation for a holistic approach to data analysis (using statistical modeling, data mining approaches, or any other data analysis methods) by structuring the main ingredients of what turns numbers into information. We called the approach information quality (InfoQ) and identified four InfoQ components and eight InfoQ dimensions.

Our main thesis is that data analysis, and especially the fields of statistics and data science, need to adapt to modern challenges and technologies by developing structured methods that provide a broad life cycle view, that is, from numbers to insights. This life cycle view needs to be focused on generating InfoQ as a key objective (for more on this see Kenett, 2015).

This book, Information Quality: The Potential of Data and Analytics to Generate Knowledge, offers an extensive treatment of InfoQ and the InfoQ framework. It is aimed at motivating researchers to further develop InfoQ elements and at students in programs that teach them how to make sure their analytic or statistical work is generating information of high quality.

Addressing this mixed community has been a challenge. On the one hand, we wanted to provide academic considerations, and on the other hand, we wanted to present examples and cases that motivate students and practitioners and give them guidance in their own specific projects.

We try to achieve this mix of objectives by combining Part I, which is mostly methodological, with Part II which is based on examples and case studies.

In Part III, we treat additional topics relevant to InfoQ such as reproducible research, the review of scientific and applied research publications, the incorporation of InfoQ in academic and professional development programs, and how three leading software platforms, R, MINITAB, and JMP support InfoQ implementations.

Researchers interested in applied statistics methods and strategies will most likely start in Part I and then move to Part II to see illustrations of the InfoQ framework applied in different domains. Practitioners and students learning how to turn numbers into information can start in a relevant chapter of Part II and move back to Part I.

A teacher or designer of a course on data analytics, applied statistics, or data science can build on examples in Part II and consolidate the approach by covering Chapter 13 and the chapters in Part I. Chapter 13 on “Integrating InfoQ into data science analytics programs, research methods courses and more” was specially prepared for this audience. We also developed five case studies that can be used by teachers as a rating‐based InfoQ assessment exercise (available at http://infoq.galitshmueli.com/class‐assignment).

In developing InfoQ, we received generous inputs from many people. In particular, we would like to acknowledge insightful comments by Sir David Cox, Shelley Zacks, Benny Kedem, Shirley Coleman, David Banks, Bill Woodall, Ron Snee, Peter Bruce, Shawndra Hill, Christine Anderson Cook, Ray Chambers, Fritz Sheuren, Ernest Foreman, Philip Stark, and David Steinberg. The motivation to apply InfoQ to the review of papers (Chapter 12) came from a comment by Ross Sparks who wrote to us: “I really like your framework for evaluating information quality and I have started to use it to assess papers that I am asked to review. Particularly applied papers.” In preparing the material, we benefited from comprehensive editorial inputs by Raquelle Azran and Noa Shmueli who generously provided us their invaluable expertise—we would like to thank them and recognize their help in improving the text language and style.

The last three chapters were contributed by colleagues. They create a bridge between theory and practice showing how InfoQ is supported by R, MINITAB, and JMP. We thank the authors of these chapters, Silvia Salini, Federica Cugnata, Elena Siletti, Ian Cox, Pere Grima, Lluis Marco‐Almagro, and Xavier Tort‐Martorell, for their effort, which helped make this work both theoretical and practical.

We are especially thankful to Professor David J. Hand for preparing the foreword of the book. David has been a source of inspiration to us for many years and his contribution highlights the key parts of our work.

In the course of writing this book and developing the InfoQ framework, the first author benefited from numerous discussions with colleagues at the University of Turin, in particular with a great visionary of the role of applied statistics in modern business and industry, the late Professor Roberto Corradetti. Roberto has been a close friend and has greatly influenced this work by continuously emphasizing the need for statistical work to be appreciated by its customers in business and industry. In addition, the financial support of the Diego de Castro Foundation that he managed has provided the time to work in a stimulating academic environment at both the Faculty of Economics and the “Giuseppe Peano” Department of Mathematics of UNITO, the University of Turin. The contributions of Roberto Corradetti cannot be underestimated and are humbly acknowledged. Roberto passed away in June 2015 and left behind a great void. The second author thanks participants of the 2015 Statistical Challenges in eCommerce Research Symposium, where she presented the keynote address on InfoQ, for their feedback and enthusiasm regarding the importance of the InfoQ framework to current social science and management research.

Finally we acknowledge with pleasure the professional help of the Wiley personnel including Heather Kay, Alison Oliver and Adalfin Jayasingh and thank them for their encouragements, comments, and input that were instrumental in improving the form and content of the book.

Ron S. Kenett and Galit Shmueli

References

Kenett, R.S. (2015) Statistics: a life cycle view (with discussion). Quality Engineering, 27(1), pp. 111–129.
Kenett, R.S. and Shmueli, G. (2014) On information quality (with discussion). Journal of the Royal Statistical Society, Series A, 177(1), pp. 3–38.
Kenett, R.S. and Thyregod, P. (2006) Aspects of statistical consulting not taught by academia. Statistica Neerlandica, 60(3), pp. 396–412.
Shmueli, G. (2010) To explain or to predict? Statistical Science, 25, pp. 289–310.
Shmueli, G. and Koppius, O.R. (2011) Predictive analytics in information systems research. MIS Quarterly, 35(3), pp. 553–572.

Quotes about the book

What experts say about Information Quality: The Potential of Data and Analytics to Generate Knowledge:

A glance at the statistics shelves of any technical library will reveal that most books focus narrowly on the details of data analytic methods. The same is true of almost all statistics teaching. This volume will help to rectify that oversight. It will provide readers with insight into and understanding of other key parts of empirical analysis, parts which are vital if studies are to yield valid, accurate, and useful conclusions.

David Hand

Imperial College, London, UK

There is an important distinction between data and information. Data become information only when they serve to inform, but what is the potential of data to inform? With the work Kenett and Shmueli have done, we now have a general framework to answer that question. This framework is relevant to the whole analysis process, showing the potential to achieve higher‐quality information at each step.

John Sall

SAS Institute, Cary, NC, USA

The authors have a rare quality: being able to present deep thoughts and sound approaches in a way practitioners can feel comfortable and understand when reading their work and, at the same time, researchers are compelled to think about how they do their work.

Fabrizio Ruggeri

Consiglio Nazionale delle Ricerche
Istituto di Matematica Applicata e Tecnologie Informatiche, Milan, Italy

No amount of technique can make irrelevant data fit for purpose, eliminate unknown biases, or compensate for data paucity. Useful, reliable inferences require balancing real‐world and theoretical considerations and recognizing that goals, data, analysis, and costs are necessarily connected. Too often, books on statistics and data analysis put formulae in the limelight at the expense of more important questions about the relevance and limitations of data and the purpose of the analysis. This book elevates these crucial issues to their proper place and provides a systematic structure (and examples) to help practitioners see the larger context of statistical questions and, thus, to do more valuable work.

Phillip Stark

University of California, Berkeley, USA

…the “Q” issue is front and centre for anyone (or any agency) hoping to benefit from the data tsunami that is said to be driving things now … And so the book will be very timely.

Ray Chambers

University of Wollongong, Australia

Kenett and Shmueli shed light on the biggest contributor to erroneous conclusions in research ‐ poor information quality coming out of a study. This issue ‐ made worse by the advent of Big Data ‐ has received too little attention in the literature and the classroom. Information quality issues can completely undermine the utility and credibility of a study, yet researchers typically deal with it in an ad‐hoc, offhand fashion, often when it is too late. Information Quality offers a sensible framework for ensuring that the data going into a study can effectively answer the questions being asked.

Peter Bruce

The Institute for Statistics Education

Policy makers rely on high quality and relevant data to make decisions and it is important that, as more and different types of data become available, we are mindful of all aspects of the quality of the information provided. This includes not only statistical quality, but other dimensions as outlined in this book including, very importantly, whether the data and analyses answer the relevant questions

John Pullinger

National Statistician, UK Statistics Authority, London, UK

This impressive book fills a gap in the teaching of statistical methodology. It deals with a neglected topic in statistical textbooks: the quality of the information provided by the producers of statistical projects and used by the customers of statistical data from surveys, administrative data etc. The emphasis in the book on: defining, discussing, analyzing the goal of the project at a preliminary stage and not less important at the analysis stage and use of the results obtained is of a major importance.

Moshe Sikron

Former Government Statistician of Israel, Jerusalem, Israel

Ron Kenett and Galit Shmueli belong to a class of practitioners who go beyond methodological prowess into questioning what purpose should be served by a data based analysis, and what could be done to gauge the fitness of the analysis to meet its purpose. This kind of insight is all the more urgent given the present climate of controversy surrounding science’s own quality control mechanism. In fact science used in support to economic or policy decision – be it natural or social science ‐ has an evident sore point precisely in the sort of statistical and mathematical modelling where the approach they advocate – Information Quality or InfoQ – is more needed. A full chapter is specifically devoted to the contribution InfoQ can make to clarify aspect of reproducibility, repeatability, and replicability of scientific research and publications. InfoQ is an empirical and flexible construct with practically infinite application in data analysis. In a context of policy, one can deploy InfoQ to compare different evidential bases pro or against a policy, or different options in an impact assessment case. InfoQ is a holistic construct encompassing the data, the method and the goal of the analysis. It goes beyond the dimensions of data quality met in official statistics and resemble more holistic concepts of performance such as analysis pedigrees (NUSAP) and sensitivity auditing. Thus InfoQ includes consideration of analysis’ Generalizability and Action Operationalization. The latter include both action operationalization (to what extent concrete actions can be derived from the information provided by a study) and construct operationalization (to what extent a construct under analysis is effectively captured by the selected variables for a given goal). A desirable feature of InfoQ is that it demands multidisciplinary skills, which may force statisticians to move out of their comfort zone into the real world. The book illustrates the eight dimensions of InfoQ with a wealth of examples. A recommended read for applied statisticians and econometricians who care about the implications of their work.

Andrea Saltelli

European Centre for Governance in Complexity

Kenett and Shmueli have made a significant contribution to the profession by drawing attention to what is frequently the most important but overlooked aspect of analytics; information quality. For example, statistics textbooks too often assume that data consist of random samples and are measured without error, and data science competitions implicitly assume that massive data sets contain high‐quality data and are exactly the data needed for the problem at hand. In reality, of course, random samples are the exception rather than the rule, and many data sets, even very large ones, are not worth the effort required to analyze them. Analytics is akin to mining, not to alchemy; the methods can only extract what is there to begin with. Kenett and Shmueli made clear the point that obtaining good data typically requires significant effort. Fortunately, they present metrics to help analysts understand the limitations of the information in hand, and how to improve it going forward. Kudos to the authors for this important contribution.

Roger Hoerl

Union College, Schenectady, NY USA