PREFACE

Error estimation determines the accuracy of the predictions that can be performed by an inferred classifier, and therefore plays a fundamental role in all scientific applications involving classification. And yet, we find that the subject is often neglected in most textbooks on pattern recognition and machine learning (with some admirable exceptions, such as the books by Devroye et al. (40) and McLachlan (110)). Most texts in the field describe a host of classification rules to train classifiers from data but treat the question of how to determine the accuracy of the inferred model only perfunctorily. Nevertheless, it is our belief, and the motivation behind this book, that the study of error estimation is at least as important as the study of classification rules themselves.

The error rate of interest could be the overall error rate on the population; however, it could also be the error rate on the individual classes. Were we to know the population distribution, we could in principle compute the error of a classifier exactly. Since we generally do not know the population distribution, we need to estimate the error from sample data. If the sample size is large enough relative to the number of features, an accurate error estimate is obtained by dividing the sample into training and testing sets, designing a classifier on the training set, and testing it on the test set. However, this is not possible if the sample size is small, in which case training and testing has to be performed on the same data. The focus of this book is on the latter case, which is common in situations where data are expensive, time-consuming, or difficult to obtain. In addition, in the case of “big data” applications, one is often dealing with small sample sizes, since the number of measurements far exceeds the number of sample points.

A classifier defines a relation between the features and the label (target random variable) relative to the joint feature–label distribution. There are two possibilities regarding the definition of a classifier: either it is derived from the feature–label distribution and therefore is intrinsic to it, as in the case of an optimal classifier, or it is not derived directly from the feature–label distribution, in which case it is extrinsic to the distribution. In both cases, classification error rates are intrinsic because they are derived from the distribution, given the classifier. Together, a classifier with a stated error rate constitute a pattern recognition model. If the classifier is intrinsic to the distribution, then there is no issue of error estimation because the feature–label distribution is known and the error rate can be derived. On the other hand, if the classifier is extrinsic to the distribution, then it is generally the case that the distribution is not known, the classifier is constructed from data via a classification rule, and the error rate of interest is estimated from the data via an error estimation rule. These rules together constitute a pattern recognition rule. (Feature selection, if present, is considered to be part of the classification rule, and thus of the pattern recognition rule.)

From a scientific perspective, that is, when a classifier is applied to phenomena, as with any scientific model, the validity of a pattern recognition model is characterized by the degree of concordance between predictions based on the model and corresponding observations.¹ The only prediction based on a pattern recognition model is the percentage of errors the classifier will make when it is applied. Hence, validity is based on concordance between the error rate in the model and the empirical error rate on the data when applied. Hence, when the distributions are not known and the model error must be estimated, model validity is completely dependent on the accuracy of the error estimation rule, so that the salient epistemological problem of classification is error estimation.²

Good classification rules produce classifiers with small average error rates. But is a classification rule good if its error rate cannot be estimated accurately? This is not just an abstract exercise in epistemology, but it has immediate practical consequences for applied science. For instance, the problem is so serious in genomic classification that the Director of the Center for Drug Evaluation and Research at the FDA has estimated that as much as 75% of published biomarker associations are not replicable. Absent an accurate error estimation rule, we lack knowledge of the error rate. Thus, the classifier is useless, no matter how sophisticated the classification rule that produced it. Therefore, one should speak about the goodness of a pattern recognition rule, as defined above, rather than that of a classification rule in isolation.

A common scenario in the literature these days proceeds in the following manner: propose an algorithm for classifier design; apply the algorithm to a few small-sample data sets, with no information pertaining to the distributions from which the data sets have been sampled; and apply an error estimation scheme based on resampling the data, typically cross-validation. With regard to the third step, we are given no characterization of the accuracy of the error estimator and why it should provide a reasonably good estimate. Most strikingly, as we show in this book, we can expect it to be inaccurate in small-sample cases. Nevertheless, the claim is made that the proposed algorithm has been “validated.” Very little is said about the accuracy of the error estimation step, except perhaps that cross-validation is close to being unbiased if not too many points are held out. But this kind of comment is misleading, given that a small bias may be of little consequence if the variance is large, which it usually is for small samples and large feature sets. In addition, the classical cross-validation unbiasedness theorem holds if sampling is random over the mixture of the populations. In situations where this is not the case, for example, the populations are sampled separately, bias is introduced, as it is shown in Chapter 5. These kinds of problems are especially detrimental in the current era of high-throughput measurement devices, for which it is now commonplace to be confronted with tens of thousands of features and very small sample sizes.

The subject of error estimation has in fact a long history and has produced a large body of literature; four main review papers summarize the major advances in the field up to 2000 (Hand, 73; McLachlan, 109; Schiavo and Hand, 131; Toussaint, 148); recent advances in error estimation since 2000 include work on model selection (Bartlett et al., 10), bolstering (Braga-Neto and Dougherty, 15; Sima et al., 135), feature selection (Hanczar et al., 72; Sima et al., 134; Xiao et al., 161; Zhou and Mao, 166), confidence intervals (Kaariainen, 91; Kaariainen and Langford, 92; Xu et al., 162), model-based second-order properties (Zollanvari et al., 170, 171), and Bayesian error estimators (Dalton and Dougherty, 30,c). This book covers the classical studies as well as the recent developments. It discusses in detail nonparametric approaches, but gives special consideration, especially in the latter part of the book, to parametric, model-based approaches.

Pattern recognition plays a key role in many disciplines, including engineering, physics, statistics, computer science, social science, manufacturing, materials, medicine, biology, and more, so this book will be useful for researchers and practitioners in all these areas. This book can serve as a text at the graduate level, can be used as a supplement for general courses on pattern recognition and machine learning, or can serve as a reference for researchers across all technical disciplines where classification plays a major role, which may in fact be all technical disciplines.

The book is organized into eight chapters. Chapters 1 and 2 provide the foundation for the rest of the book and must be read first. Chapters 3, 4, and 8 stand on their own and can be studied separately. Chapter 5 provides the foundation for Chapters 6 and 7, so these chapters should be read in this sequence. For example, chapter sequences 1-2-3-4, 1-2-5-6-7, and 1-2-8 are all possible ways of reading the book. Naturally, the book is best read from beginning to end. Short descriptions of each chapter are provided next.

Chapter 1. Classification. To make the book self-contained, the first chapter covers basic topics in classification required for the remainder of the text: classifiers, population-based and sample-based discriminants, and classification rules. It defines a few basic classification rules: LDA, QDA, discrete classification, nearest-neighbors, SVMs, neural networks, and classification trees, closing with a section on feature selection.

Chapter 2. Error Estimation. This chapter covers the basics of error estimation: definitions, performance metrics for estimation rules, test-set error estimation, and training-set error estimation. It also includes a discussion of pattern recognition models. The test-set error estimator is straightforward and well-understood; there being efficient distribution-free bounds on performance. However, it assumes large sample sizes, which may be impractical. The thrust of the book is therefore on training-set error estimation, which is necessary for small-sample classifier design. We cover in this chapter the following error estimation rules: resubstitution, cross-validation, bootstrap, optimal convex estimation, smoothed error estimation, and bolstered error estimation.

Chapter 3. Performance Analysis. The main focus of the book is on performance characterization for error estimators, a topic whose coverage is inadequate in most pattern recognition texts. The fundamental entity in error analysis is the joint distribution between the true error and the error estimate. Of special importance is the regression between them. This chapter discusses the deviation distribution between the true and estimated error, with particular attention to the root-mean-square (RMS) error as the key metric of error estimation performance. The chapter covers various other issues: the impact of error estimation on feature selection, bias that can arise when considering multiple data sets or multiple rules, and measurement of performance reproducibility.

Chapter 4. Error Estimation for Discrete Classification. For discrete classification, the moments of resubstitution and the basic resampling error estimators, cross-validation and bootstrap, can be represented in finite-sample closed forms, as presented in the chapter. Using these representations, formulations of various performance measures are obtained, including the bias, variance, deviation variance, and the RMS and correlation between the true and estimated errors. Next, complete enumeration schemes to represent the joint and conditional distributions between the true and estimated errors are discussed. The chapter concludes by surveying large-sample performance bounds for various error estimators.

Chapter 5. Distribution Theory. This chapter provides general distributional results for discriminant-based classifiers. Here we also introduce the issue of separate vs. mixture sampling. For the true error, distributional results are in terms of conditional probability statements involving the discriminant. Passing on to resubstitution and the resampling-based estimators, the error estimators can be expressed in terms of error-counting expressions involving the discriminant. With these in hand, one can then go on to evaluate expected error rates and higher order moments of the error rates, all of which take combinatorial forms. Second-order moments are included, thus allowing computation of the RMS. The chapter concludes with analytical expressions for the sampling distribution for resubstitution and leave-one-out cross-validation.

Chapter 6. Gaussian Distribution Theory: Univariate Case. Historically, much effort was put into characterizing expected error rates for linear discrimination in the Gaussian model for the true error, resubstitution, and leave-one-out. This chapter considers these, along with more recent work on the bootstrap. It then goes on to provide exact expressions for the first- and higher-order moments of various error rates, thereby providing an exact analytic expression for the RMS. The chapter provides exact expressions for the marginal sampling distributions of the resubstitution and leave-one-out error estimators. It closes with comments on the joint distribution of true and estimated error rates.

Chapter 7. Gaussian Distribution Theory: Multivariate Case. This chapter begins with statistical representations for multivariate discrimination, including Bowkers classical representation for Anderson's LDA discriminant, Moran’s representations for John’s LDA discriminant, and McFarland and Richards' QDA discriminant representation. A discussion follows on computational techniques for obtaining the error rate moments based on these representations. Large-sample methods are then treated in the context of double-asymptotic approximations where the sample size and feature dimension both approach infinity, in a comparable fashion. This leads to asymptotically-exact, finite-sample approximations to the first and second moments of various error rates, which lead to double-asymptotic approximations for the RMS.

Chapter 8. Bayesian MMSE Error Estimation. In small-sample situations it is virtually impossible to make any useful statements concerning error-estimation accuracy unless distributional assumptions are made. In previous chapters, these distributional assumptions were made from a frequentist point of view, whereas this chapter considers a Bayesian approach to the problem. Partial assumptions lead to an uncertainty class of feature–label distributions and, assuming a prior distribution over the uncertainty class, one can obtain a minimum-mean-square-error (MMSE) estimate of the error rate. In opposition to the classical case, a sample-conditioned MSE of the error estimate can be defined and computed. In addition to the general MMSE theory, this chapter provides specific representations of the MMSE error estimator for discrete classification and linear classification in the Gaussian model. It also discusses consistency of the estimate and calibration of non-Bayesian error estimators relative to a prior distribution governing model uncertainty.

We close by noting that for researchers in applied mathematics, statistics, engineering, and computer science, error estimation for pattern recognition offers a wide open field with fundamental and practically important problems around every turn. Despite this fact, error estimation has been a neglected subject over the past few decades, even though it spans a huge application domain and its understanding is critical for scientific epistemology. This text, which is believed to be the first book devoted entirely to the topic of error estimation for pattern recognition, is an attempt to correct the record in that regard.

ULISSES M. BRAGA-NETO
EDWARD R. DOUGHERTY

College Station, Texas
April 2015

NOTES

LIST OF SYMBOLS

X = (X₁, …, X_d): feature vector
Π₀$, $Π₁: populations or classes
Y: population or class label
F_XY: feature–label distribution
c₀, c₁ = 1 − c₀: population prior probabilities
p₀(x)$, $p₁(x): class-conditional densities
p(x) = c₀p₀(x) + c₁p₁(x): density across mixture of populations
η₀(x)$, $η₁(x) = 1 − η₀(x): population posterior probabilities
ψ$, $ϵ: classifier and classifier true error rate
ψ_bay$, $ϵ_bay: Bayes classifier and Bayes error
ϵ⁰$, $ϵ¹: population-specific true error rates
D: population-based discriminant
X ∼ Z: X has the same distribution as Z
: population mean vectors
Σ₀, Σ₁: population covariance matrices
Σ = Σ₀ = Σ₁: population common covariance matrix
: population mean vectors
δ: Mahalanobis distance
Φ(x): cdf of N(0, 1) Gaussian random variable
n: sample size
n₀$, $n₁ = n − n₀: population-specific sample sizes
S: sample data
: sample data with class labels flipped
Ψ: classification rule
S_n$, $Ψ_n$, $ψ_n$, $ϵ_n: _``n'' indicates mixture sampling design

: indicates separate sampling design
: set of classification rules
: VC dimension and shatter coefficients
W: sample-based discriminant
: population-specific sample mean vectors
: population-specific sample covariance matrices
: pooled sample covariance matrix
: Anderson's and John's LDA discriminants
p_i$, $q_i$, $U_i$, $V_i: population-specific bin probabilities and counts
ϒ: feature selection rule
Ξ: error estimation rule
: error estimator
: population-specific estimated error rates
: error estimator bias
: error estimator deviation variance
: error estimator root mean square error
: error estimator internal variance
: error estimator correlation coefficient
: classical resubstitution estimator
: resubstitution estimator with known c₀, c₁
: classical leave-one-out estimator
: leave-one-out estimator with known c₀, c₁
: classical k-fold cross-validation estimator
: leave-(k₀, k₁)-out cross-validation estimator with known c₀, c₁
C$, $C₀$, $C₁: multinomial (bootstrap) vectors
: bootstrap samples
F*_XY: bootstrap empirical distribution
: classical zero bootstrap estimator
: classical complete zero bootstrap estimator
: classical .632 bootstrap estimator
: complete zero bootstrap estimator with known c₀, c₁
: bolstered resubstitution estimator
: bolstered leave-one-out estimator
A_est: estimated comparative advantage
C_bias: comparative bias
R_n(ρ, τ): reproducibility index

George W. Arnold	Mary Lanzerotti	Linda Shafer
Dmitry Goldgof	Pui-In Mak	MengChu Zhou
Ekram Hossain	Ray Perez	George Zobrist

ERROR ESTIMATION FOR PATTERN RECOGNITION

ULISSES M. BRAGA-NETO
EDWARD R. DOUGHERTY

PREFACE

NOTES

ACKNOWLEDGMENTS

LIST OF SYMBOLS

ULISSES M. BRAGA-NETO EDWARD R. DOUGHERTY

NOTES

ULISSES M. BRAGA-NETO
EDWARD R. DOUGHERTY