Contents

Cover

Half Title page

Title page

Copyright page

Preface

Chapter 1: What is Data Analysis?

1.1 Tukey’s 1962 Paper

1.2 The Path of Statistics

Chapter 2: Strategy Issues in Data Analysis

2.1 Strategy in Data Analysis

2.2 Philosophical Issues

2.3 Issues of Size

2.4 Strategic Planning

2.5 The Stages of Data Analysis

2.6 Tools Required for Strategy Reasons

Chapter 3: Massive Data Sets

3.1 Introduction

3.2 Disclosure: Personal Experiences

3.3 What Is Massive? A Classification of Size

3.4 Obstacles to Scaling

3.5 On The Structure of Large Data Sets

3.6 Data Base Management And Related Issues

3.7 The Stages of A Data Analysis

3.8 Examples and Some Thoughts on Strategy

3.9 Volume Reduction

3.10 Supercomputers and Software Challenges

3.11 Summary of Conclusions

Chapter 4: Languages for Data Analysis

4.1 Goals and Purposes

4.2 Natural Languages and Computing Languages

4.3 Interface Issues

4.4 Miscellaneous Issues

4.5 Requirements for A General Purpose Immediate Language

Chapter 5: Approximate Models

5.1 Models

5.2 Bayesian Modeling

5.3 Mathematical Statistics and Approximate Models

5.4 Statistical Significance and Physical Relevance

5.5 Judicious Use of A Wrong Model

5.6 Composite Models

5.7 Modeling The Length of Day

5.8 The Role of Simulation

5.9 Summary of Conclusions

Chapter 6: Pitfalls

6.1 Simpson’s Paradox

6.2 Missing Data

6.3 Regression of Y on X Or of X on Y?

Chapter 7: Create Order in Data

7.1 General Considerations

7.2 Principal Component Methods

7.3 Multidimensional Scaling

7.4 Correspondence Analysis

7.5 Multidimensional Scaling vs. Correspondence Analysis

Chapter 8: More Case Studies

8.1 A Nutshell Example

8.2 Shape Invariant Modeling

8.3 Comparison of Point Configurations

8.4 Notes on Numerical Optimization

References

Index

Copyright © 2011 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Huber, Peter J.

Data analysis: what can be learned from the past 50 years / Peter J. Huber.

p. cm. — (Wiley series in probability and statistics; 874)

Includes bibliographical references and index.

ISBN 978-1-118-01064-8 (hardback)

1. Mathematical statistics—History. 2. Mathematical statistics—Philosophy. 3. Numerical analysis—Methodology. I. Title.

QA276.15.H83 2011

519.509—dc22 2010043284

PREFACE

“These prolegomena are not for the use of apprentices, but of future teachers, and indeed are not to help them to organize the presentation of an already existing science, but to discover the science itself for the first time.” (Immanuel Kant, transl. Gary Hatfield.)

“Diese Prolegomena sind nicht zum Gebrauch vor Lehrlinge, sondern vor künftige Lehrer, und sollen auch diesen nicht etwa dienen, um den Vortrag einer schon vorhandnen Wissenschaft anzuordnen, sondern um diese Wissenschaft selbst allererst zu erfinden.” (Immanuel Kant, 1783)

How do you learn data analysis?

First, how do you teach data analysis? This is a task going beyond “organizing the presentation of an already existing science”. At several academic institutions, we tried to teach it in class, as part of an Applied Statistics requirement. Sometimes I was actively involved, but mostly I played the part of an interested observer. In all cases we ultimately failed. Why?

I believe the main reason was: it is easy to teach data analytic techniques, but it is difficult to teach their use in actual applied situations. We could not force the students to immerse themselves into the underlying subject matter. For the students, acquiring the necessary background information simply was too demanding an effort, in particular since not every subject matter appeals to every person. What worked, at least sort of, was the brutal approach we used in the applied part of the Ph.D. qualifying exam at Harvard. We handed the students a choice of some non-trivial data analysis problems. We gave them some general advice. They were asked to explain the purpose of the study furnishing the data, what questions they would want to answer and what questions they could answer with the data at hand. They should explore the data, decide on the appropriate techniques and check the validity of their results. But we refrained from giving problem-specific hints. They could prepare themselves for the exam by looking at problems and exemplary solutions from previous years. Occasionally, the data had some non-obvious defects – such as the lack of proper matching between carriers and controls, mentioned in Section 2.3. The hidden agenda behind giving those problems was not the surface issue of exam grades, but an attempt to force our students to think about non-trivial data analysis issues.

Clearly, the catalog of tools a data analyst can draw upon is immense – I recall John Tukey’s favorite data analytic aphorism: All things are lawful for me, but not all things are helpful (1 Cor 6.12). Any attempt to cover in class more than a tiny fraction of the potentially useful tools would be counterproductive. But there are systematic defects and omissions – some of which had been criticized already back in 1940 by Deming. The typical curricula taught in statistics departments are too sanitized to form a basis for an education in data analysis. For example, they hardly ever draw attention to the pitfalls of Simpson’s paradox. As a student, you do not learn to use general nonlinear fitting methods, such as nonlinear least squares. Methods that do not produce quantitative outcomes (like P-values), but merely pictorial ones, such as correspondence analysis or multidimensional scaling, as a rule are neglected.

As a data analyst, you often will have to improvise your own approaches. For the student, it therefore is indispensable to become fluent in a computer language suitable for data analysis, that is, a programmable command language (sometimes, but not entirely accurately, called a script language).

I worry about an increasing tendency, which nowadays seems to affect university instruction quite generally. Namely, instead of fostering creative improvisation and innovation by teaching the free use of the given means (I am paraphrasing Clausewitz, quoted at the opening of Section 2.2.1), one merely teaches the use of canned program packages, which has the opposite effect and is stifling.

But to come back to the initial question: how do you learn data analysis? In view of our negative experiences with the course work approach, I believe it has to be learned on the job, through apprenticeship and anecdotes rather than through systematic exposition. This may very well be the best way to teach it; George Box once pointedly remarked that you do not learn to swim from books and lectures on the theory of buoyancy (Box 1990). The higher aspects of the art you have to learn on the job.

But there are deeper questions concerning the fundamentals of data analysis. By 1990 I felt that I knew enough about the opportunities and technicalities of interactive data analysis and graphics, and I drifted into what you might call the philosophy of data analysis. Mere concern with tools and techniques is not sufficient, it is necessary to concern oneself about their proper use – when to use which technique – that is, with questions of strategy.

The advent of ever larger data sets with ever more complicated structures – some illustrative examples are discussed in Section 3.8 – forces one to re-think basic issues. I became upset about the tunnel vision of many statisticians who only see homogeneous data sets and their exclusive emphasis on tools suited for dealing with homogeneous, unstructured data. Then there is the question of how to provide suitable computing support, to permit the data analyst the free and creative use of the available means. I already mentioned that script languages are indispensable, but how should they be structured? Finally, I had again to come back to the question of how one can model unsanitized data, and how to deal with such models.

The central four chapters of this book are based on four workshop contributions of mine concerned with these issues: strategy, massive data sets, computing languages, and models. The four papers have stood the test of time remarkably well in this fast-living computer world. They are sprinkled with examples and case studies. In order that the chapters can be read independently of each other, I decided to let some overlaps stand.

The remaining chapters offer an idiosyncratic choice of issues, putting the emphasis on notoriously neglected topics. They are discussed with the help of more examples and case studies, mostly drawn from personal experiences collected over the past 50 years. Most of the examples were chosen to be small, so that the data could be presented on a few printed pages, but I believe their relevance extends far beyond small sets. On purpose, I have presented one large case study in unsanitized, gory detail (Section 5.7). Chapter 6 is concerned with what I consider some of the most pernicious pitfalls of data mining, namely Simpson’s paradox (or the neglect of inhomogeneity), invisible missing values, and conceptual misinterpretation of regression. Chapter 7 is concerned with techniques to create order in the data, that is, with techniques needed as first steps when one is confronted with inhomogeneous data, and Chapter 8 is concerned with some mixed issues, among them with dimension reduction through nonlinear local modeling, including a brief outline of numerical optimization. I believe that the examples and case studies presented in this book cover in reasonable detail all stages of data analysis listed in Section 2.5, with the exception of the last: “Presentation of Conclusions”. The issues of the latter are briefly discussed in Section 2.5.9, but they are not very conducive for presentation in form of a case study. The case studies also happen to exemplify applications of some most helpful tools about whose neglect in the education of statistics students I have complained above: the singular value decomposition, nonlinear weighted least squares, simulation of stochastic models, scatter- and curve plots.

Finally, since most of my work as a mathematical statistician has been on robustness, it is appropriate that I state my credo with regard to that topic. In data analysis, robustness has pervasive importance, but it forms part of a general diligence requirement and therefore stays mostly beneath the surface. In my opinion the crucial attribute of robust methods is stability under small perturbations of the model, see in particular the discussion at the beginning of Section 5.1. Robustness is more than a bag of procedures. It should rather be regarded as a state of mind: a statistician should keep in mind that all aspects of a data analytic setup (experimental design, data collection, models, procedures) must be handled in such a way that minor deviations from the assumptions cannot have large effects on the results (a robustness problem), and that major deviations can be discovered (a diagnostic problem). Only a small part of this can be captured by theorems and proofs, or by canned computer procedures.

PETER J. HUBER

Klosters December 2010

CHAPTER 1

WHAT IS DATA ANALYSIS?

Data analysis is concerned with the analysis of data – of any kind, and by any means. If statistics is the art of collecting and interpreting data, as some have claimed, ranging from planning the collection to presenting the conclusions, then it covers all of data analysis (and some more). On the other hand, while much of data analysis is not statistical in the traditional sense of the word, it sooner or later will put to good use every conceivable statistical method, so the two terms are practically coextensive. But with Tukey (1962) I am generally preferring “data analysis” over “statistics” because the latter term is used by many in an overly narrow sense, covering only those aspects of the field that can be captured through mathematics and probability.

I had been fortunate to get an early headstart. My involvement with data analysis (together with that of my wife who then was working on her thesis in X-ray crystallography), goes back to the late 1950s, a couple of years before I thought of switching from topology into mathematical statistics. At that time we both began to program computers to assist us in the analysis of data – I got involved through my curiosity in the novel tool. In 1970, we were fortunate to participate in the arrival of non-trivial 3-d computer graphics, and at that time we even programmed a fledgling expert system for molecular model fitting. From the late 1970s onward, we got involved in the development and use of immediate languages for the purposes of data analysis.

Clearly, my thinking has been influenced by the philosophical sections of Tukey’s paper on “The Future of Data Analysis” (1962). While I should emphasize that in my view data analysis is concerned with data sets of any size, I shall pay particular attention to the requirements posed by large sets – data sets large enough to require computer assistance, and possibly massive enough to create problems through sheer size – and concentrate on ideas that have the potential to extend beyond small sets. For this reason there will be little overlap with books such as Tukey’s Exploratory Data Analysis (EDA) (1977), which was geared toward the analysis of small sets by pencil-and-paper methods.

Data analysis is rife with unreconciled contradictions, and it suffices to mention a few. Most of its tools are statistical in nature. But then, why is most data analysis done by non-statisticians? And why are most statisticians data shy and reluctant even to touch large data bases? Major data analyses must be planned carefully and well in advance. Yet, data analysis is full of surprises, and the best plans will constantly be thrown off track. Any consultant concerned with more than one application is aware that there is a common unity of data analysis, hidden behind a diversity of language, and stretching across most diverse fields of application. Yet, it does not seem to be feasible to learn it and its general principles that span across applications in the abstract, from a textbook: you learn it on the job, by apprenticeship, and by trial and error. And if you try to teach it through examples, using actual data, you have to walk a narrow line between getting bogged down in details of the domain-specific background, or presenting unrealistic, sanitized versions of the data and of the associated problems.

Very recently, the challenge posed by these contradictions has been addressed in a stimulating workshop discussion by Sedransk et al. (2010, p. 49), as Challenge #5 – To use available data to advance education in statistics. The discussants point out that a certain geological data base “has created an unforeseen enthusiasm among geology students for data analysis with the relatively simple internal statistical methodology.” That is, to use my terminology, the appetite of those geology students for exploring their data had been whetted by a simple-minded decision support system, see Section 2.5.9. The discussants wonder whether “this taste of statistics has also created a hunger for [&] more advanced statistical methods.” They hope that “utilizing these large scientific databases in statistics classes allows primary investigation of interdisciplinary questions and application of exploratory, high-dimensional and/or other advanced statistical methods by going beyond textbook data sets.” I agree in principle, but my own expectations are much less sanguine. No doubt, the appetite grows with the eating (cf. Section 2.5.9), but you can spoil it by offering too much sophisticated and exotic food. It is important to leave some residual hunger! Instead of fostering creativity, you may stifle ingenuity and free improvisation by overwhelming the user with advanced methods. The geologists were in a privileged position: the geophysicists have a long-standing, ongoing strong relation with probability and statistics – just think of Sir Harold Jeffreys! – and the students were motivated by the data. The (unsurmountable?) problem with statistics courses is that it is difficult to motivate statistics students to immerse themselves into the subject matter underlying those large scientific data bases.

But still, the best way to convey the principles, rather than the mere techniques of data analysis, and to prepare the general mental framework, appears to be through anecdotes and case studies, and I shall try to walk this way. There are more than enough textbooks and articles explaining specific statistical techniques. There are not enough texts concerned with issues of overall strategy and tactics, with pitfalls, and with statistical methods (mostly graphical) geared toward providing insight rather than quantifiable results. So I shall concentrate on those, to the detriment of the coverage of specific techniques. My principal aim is to distill the most important lessons I have learned from half a century of involvement with data analysis, in the hope to lay the groundwork for a future theory. Indeed, originally I had been tempted to give this book the ambitious programmatic title: Prolegomena to the Theory and Practice of Data Analysis.

Some comments on Tukey’s paper and some speculations on the path of statistics may be appropriate.

1.1 TUKEY’S 1962 PAPER

Half a century ago, Tukey in an ultimately enormously influential paper (Tukey 1962) redefined our subject, see Mallows (2006) for a retrospective review. It introduced the term “data analysis” as a name for what applied statisticians do, differentiating this from formal statistical inference. But actually, as Tukey admitted, he “stretched the term beyond its philology” to such an extent that it comprised all of statistics. The influence of Tukey’s paper was not immediately recognized. Even for me, who had been exposed to data analysis early on, it took several years until I assimilated its import and recognized that a separation of “statistics” and “data analysis” was harmful to both.

Tukey opened his paper with the words:

For a long time I have thought that I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. And when I have pondered about why such techniques as the spectrum analysis of time series have proved so useful, it has become clear that their “dealing with fluctuations” aspects are, in many circumstances, of lesser importance than the aspects that would already have been required to deal effectively with the simpler case of very extensive data, where fluctuations would no longer be a problem. All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.

Large parts of data analysis are inferential in the sample-to-population sense, but these are only parts, not the whole. Large parts of data analysis are incisive, laying bare indications which we could not perceive by simple and direct examination of the raw data, but these too are parts, not the whole. Some parts of data analysis [&] are allocation, in the sense that they guide us in the distribution of effort [&]. Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.

A little later, Tukey emphasized:

Data analysis, and the parts of statistics which adhere to it, must then take on the characteristics of a science rather than those of mathematics, specifically:

(1) Data analysis must seek for scope and usefulness rather than security.

(2) Data analysis must be willing to err moderately often in order that inadequate evidence shall more often suggest the right answer.

(3) Data analysis must use mathematical argument and mathematical results as bases for judgment rather than as bases for proofs or stamps of validity.

A few pages later he is even more explicit: “In data analysis we must look to a very heavy emphasis on judgment.” He elaborates that at least three different sorts or sources of judgment are likely to be involved in almost every instance: judgment based on subject matter experience, on a broad experience how particular techniques have worked out in a variety of fields of application, and judgment based on abstract results, whether obtained by mathematical proofs or empirical sampling.

In my opinion the main, revolutionary influence of Tukey’s paper indeed was that he shifted the primacy of statistical thought from mathematical rigor and optimality proofs to judgment. This was an astounding shift of emphasis, not only for the time (the early 1960s), but also for the journal in which his paper was published, and last but not least, in regard of Tukey’s background – he had written a Ph.D. thesis in pure mathematics, and one variant of the axiom of choice had been named after him.

Another remark of Tukey also deserves to be emphasized: “Large parts of data analysis are inferential in the sample-to-population sense, but these are only parts, not the whole.” As of today, too many statisticians still seem to cling to the traditional view that statistics is inference from samples to populations (or: virtual populations). Such a view may serve to separate mathematical statistics from probability theory, but is much too exclusive otherwise.

1.2 THE PATH OF STATISTICS

This section describes my impression of how the state of our subject has developed in the five decades since Tukey’s paper. I begin with quotes lifted from conferences on future directions for statistics. As a rule, the speakers expressed concern about the sterility of academic statistics and recommended to get renewed input from applications. I am quoting two of the more colorful contributions. G. A. Barnard said at the Madison conference on the Future of Statistics (Watts, ed. (1968)):

Most theories of inference tend to stifle thinking about ingenuity and may indeed tend to stifle ingenuity itself. Recognition of this is one expression of the attitude conveyed by some of our brethren who are more empirical than thou and are always saying, ‘Look at the data.’ That is, their message seems to be, in part, ‘Break away from stereotyped theories that tend to stifle ingenious insights and do something else.’

And H. Robbins said at the Edmonton conference on Directions for Mathematical Statistics (Ghurye, ed. (1975)):

An intense preoccupation with the latest technical minutiae, and indifference to the social and intellectual forces of tradition and revolutionary change, combine to produce the Mandarinism that some would now say already characterizes academic statistical theory and is most likely to describe its immediate future. [& T]he statisticians of the past came into the subject from other fields – astronomy, pure mathematics, genetics, agronomy, economics etc. – and created their statistical methodology with a background of training in a specific scientific discipline and a feeling for its current needs. [&]

So for the future I recommend that we work on interesting problems [and] avoid dogmatism.

At the Edmonton conference, my own diagnosis of the situation had been that too many of the activities in mathematical statistics belonged to the later stages of what I called ‘Phase Three’:

In statistics as well as in any other field of applied mathematics (taken in the wide sense), one can usually distinguish (at least) three phases in the development of a problem. In Phase One, there is a vague awareness of an area of open problems, one develops ad hoc solutions to poorly posed questions, and one gropes for the proper concepts. In Phase Two, the ‘right’ concepts are found, and a viable and convincing theoretical (and therefore mathematical) treatment is put together.

In Phase Three, the theory begins to have a life of its own, its consequences are developed further and further, and its boundaries of validity are explored by leading it ad absurdum; in short, it is squeezed dry.

A few years later, in a paper entitled “Data Analysis: in Search of an Identity” (Huber 1985a), I tried to identify the then current state of our subject. I speculated that statistics is evolving, in the literal sense of that word, along a widening spiral. After a while the focus of concern returns, although in a different track, to an earlier stage of the development and takes a fresh look at business left unfinished during the last turn (see Exhibit 1.1).

During much of the 19th century, from Playfair to Karl Pearson, descriptive statistics, statistical graphics and population statistics had flourished. The Student-Fisher-Neyman-Egon Pearson-Wald phase of statistics (roughly 1920–1960) can be considered a reaction to that period. It stressed those features in which its predecessor had been deficient and paid special attention to small sample statistics, to mathematical rigor, to efficiency and other optimality properties, and coincidentally, to asymptotics (because few finite sample problems allow closed form solutions).

I expressed the view that we had entered a new developmental phase. People would usually associate this phase with the computer, which without doubt was an important driving force, but there was more to it, namely another go-around at the features that had been in fashion a century earlier but had been neglected by 20th century mathematical statistics, this time under the banner of data analysis. Quite naturally, because this was a strong reaction to a great period, one sometimes would go overboard and evoke the false impression that probability models were banned from exploratory data analysis.

There were two hostile camps, the mathematical statisticians and the exploratory data analysts (I felt at home in both). I still remember an occasion in the late 1980s, when I lectured on high-interaction graphics and exploratory data analysis and a prominent person rose and commented in shocked tones whether I was aware that what I was doing amounted to descriptive statistics, and whether I really meant it!

Still another few years later I elaborated my speculations on the path of statistics (Huber 1997a). In the meantime I had come to the conclusion that my analysis would have to be amended in two respects: first, the focus of attention only in part moves along the widening spiral. A large part of the population of statisticians remains caught in holding patterns corresponding to an earlier paradigm, presumably one imprinted on their minds at a time when they had been doing their thesis work, and too many members of the respective groups are unable to see beyond the edge of the eddy they are caught in, thereby losing sight of the whole, of a dynamically evolving discipline. Unwittingly, a symptomatic example had been furnished in 1988 by the editors of Statistical Science. Then they re-published Harold Hotelling’s 1940 Presidential Address on “The Teaching of Statistics”, but forgot all about Deming’s brief (one and a half pages), polite but poignant discussion. This discussion is astonishing. It puts the finger on deficiencies of Hotelling’s otherwise excellent and balanced outline, it presages Deming’s future role in quality control, and it also anticipates several of the sentiments voiced by Tukey more than twenty years later. Deming endorses Hotelling’s recommendations but says that he takes it “that they are not supposed to embody all that there is in the teaching of statistics, because there are many other neglected phases that ought to be stressed.” In particular, he points out Hotelling’s neglect of simple graphical tools, and that he ignores problems arising from inhomogeneity. It suffices to quote three specific recommendations from Deming’s discussion: “The modern student, and too often his teacher, overlook the fact that such a simple thing as a scatter diagram is a more important tool of prediction than the correlation coefficient, especially if the points are labeled so as to distinguish the different sources of the data.” “Students are not usually admonished against grouping data from heterogeneous sources.” “Above all, a statistician must be a scientist” – Clearly, by 1988 a majority of the academic statistical community still was captivated by the Neyman-Pearson and de Finetti paradigms, and the situation had not changed much by 1997.

Second, most data analysis is done by non-statisticians, and the commonality is hidden behind a diversity of languages. As a consequence, many applied fields have developed their own versions of statistics, together with their own, specialized journals. The spiral has split into many, roughly parallel paths and eddies. This leads to a Balkanization of statistics, and to the rise of sects and local gurus propagating “better” techniques which really are worse. For example, the crystallographers, typically confronted with somewhat long-tailed data distributions, abandoned mean absolute deviations in favor of root mean square deviations some years after the lack of robustness of the latter had become generally known among professional statisticians.

Where are we now, another 13 years later? I am pleased to note that an increasing number of first rate statisticians of the younger generation are committing “interdisciplinary adultery” (an expression jokingly used by Ronald Pyke in his “Ten Commandments” at the Edmonton conference) by getting deeply involved with interesting projects in the applied sciences. Also, it seems to me that some of the captivating eddies finally begin to fade. But new ones emerge in the data analysis area. I am referring for instance to the uncritical creation and propagation of poorly tested, hyped-up novel algorithms – this may have been acceptable in an early ‘groping’ phase, but by now, one would have expected some methodological consolidation and more self-control.

Though, prominent eddies of the classical variety are still there. I regretfully agree with the late Leo Breiman’s (2004) sharp criticism of the 2002 NSF Report on the Future of Statistics (“the report is a step into the past and not into the future”). See also Colin Mallows (2006). In my opinion the NSF report is to be faulted because it looked inward instead of outward, and that prevented it from looking beyond the theoretical “core” of statistics. Thereby, it was harking back to the time before Tukey’s seminal 1962 paper, rather than providing pointers to the time after Tukey’s “Future”. It gave a fine description of ivory tower theoretical statistics, but it pointedly excluded analysis of actual data. This is like giving a definition of physics that excludes experimental physics. Each of the two has no life without the other.

Despite such throwbacks I optimistically believe that in the 25 years since I first drew the spiral of Exhibit 1.1 we have moved somewhat further along as a profession and now are reaching a point alongside of Karl Pearson, a century later. The focus of concern is again on models, but this time on dealing with highly complex ones in the life sciences and elsewhere, and on assessing goodness-of-fit with the help of simulation. While many applied statisticians still seem to live exclusively within the classical framework of tests and confidence levels, it becomes increasingly obvious that one has to go beyond mere tests of goodness-of-fit. One must take seriously the admonition by McCullagh and Nelder (1983, p.6) “that all models are wrong; some, though, are better than others and we can search for the better ones.” Apart from searching for better models, we must also learn when to stop the search, that is: we must address questions of model adequacy. This may be foremost among the unfinished business. Incidentally, models are one of the two issues where Tukey’s foresight had failed (in his 1962 paper he had eschewed modeling, and he had underestimated the impact of the computer).

But where are we going from here? I do not know. If my spiral gives any indication, we should return to business left unfinished during the last turnaround in the Student-Fisher-Neyman-Egon Pearson-Wald phase. The focus will not return to small samples and the concomitant asymptotics – this is finished business. But we might be headed towards another period of renewed interest in the theoretical (but this time not necessarily mathematical) penetration of our subject, with emphasis on the data analytic part.