Cover Page

Ensemble Classification Methods with Applications in R

Edited by

Esteban Alfaro, Matías Gámez and Noelia García

University of Castilla-La Mancha, Spain

Wiley Logo

List of Contributors

Esteban Alfaro, Economics and Business Faculty, Institute for Regional Development, University of Castilla‐La Mancha.

Sanyogita Andriyas, Department of Civil & Environmental Engineering, Utah State University, Logan, Utah, USA.

Eva Cernadas, Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Santiago de Compostela, Spain.

Mariola Chrzanowska, Faculty of Applied Informatics and Mathematics, Department of Econometrics and Statistics, Warsaw University of Life Sciences (SGGW), Warsaw, Poland.

Davy Cielen, Maastricht School of Management, Maastricht, the Netherlands.

Kristof Coussement, IESEG Center for Marketing Analytics (ICMA), IESEG School of Management, Université Catholique de Lille, Lille, France.

Koen W. De Bock, Audencia Business School, Nantes, France.

Manuel Fernández‐Delgado, Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Santiago de Compostela, Spain.

Matías Gámez, Institute for Regional Development, University of Castilla‐La Mancha.

Noelia García, Economics and Business Faculty, University of Castilla‐La Mancha.

Mariusz Kubus, Department of Mathematics and Computer Science Applications, Opole University of Technology, Poland.

Mac McKee, Utah Water Research Laboratory and Department of Civil & Environmental Engineering, Utah State University, Logan, Utah, USA.

María Pérez‐Ortiz, Department of Quantitative Methods, University of Loyola Andalucía, Córdoba, Spain.

Wojciech Rejchel, Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Torun, Poland; Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland.

Dorota Witkowska, Department of Finance and Strategic Management, University of Lodz, Lodz, Poland.

List of Tables

    1. Table 1.1 Comparison of real error rate estimation methods.
    1. Table 2.1 Error decomposition of a classifier (Tibshirani, 1996a).
    1. Table 3.1 Example of the weight updating process in AdaBoost.
    2. Table 3.2 Example of the weight updating process in AdaBoost.M1.
    1. Table 5.1 Description of variables (some of these ratios are explained in White et al. (2003)).
    2. Table 5.2 Results obtained from descriptive analysis and ANOVA (SW test = Shapiro‐Wilks test; KS test = Kolmogorov‐Smirnov test).
    3. Table 5.3 Correlation matrix.
    4. Table 5.4 Unstandardized coefficients of the canonical discriminant function.
    5. Table 5.5 Confusion matrix and errors with LDA.
    6. Table 5.6 Confusion matrix and errors with an artificial neural network.
    7. Table 5.7 Sensitive analysis.
    8. Table 5.8 Confusion matrix and errors with AdaBoost.
    9. Table 5.9 Comparison of results with other methods.
    10. Table 5.10 Confusion matrix and errors with the pruned tree.
    11. Table 5.11 Confusion matrix and errors with the AdaBoost.M1 model.
    1. Table 6.1 Collection of 121 data sets from the UCI database and our real‐world problems. ac‐inflam, acute inflammation; bc, breast cancer; congress‐voting, congressional voting; ctg, cardiotocography; conn‐bench‐sonar, connectionist benchmark sonar mines rocks; conn‐bench‐vowel, connectionist benchmark vowel deterding; pb, Pittsburg bridges; st, statlog; vc, vertebral column.
    2. Table 6.2 Friedman ranking, average accuracy and Cohen images (both in %) for the 30 best classifiers, ordered by increasing Friedman ranking. BG, bagging; MAB, MultiBoostAB; RC, RandomCommittee.
    3. Table 6.3 Classification results: accuracy, Cohen images (both in %), mean absolute error (MAE), Kendall images, and Spearman images for species MC and TL with three stages.
    4. Table 6.4 Classification results for the species RH with six stages using the LOIO and MIX methodologies.
    5. Table 6.5 Confusion matrices and sensitivities/positive predictivities for each stage (in %) achieved by SVMOD and GSVM for species RH and LOIO experiments.
    1. Table 7.1 Errors of estimators.
    1. Table 8.1 Predictor variables, the represented factors as seen by the farmer, and the target variable used for trees analysis.
    2. Table 8.2 The cost‐complexity parameter (CP), relative error, cross‐validation error (xerror), and cross‐validation standard deviation (xstd) for trees with nsplit from 0 to 8.
    3. Table 8.3 Accuracy estimates on test data for CART models. Resub, resubstitution accuracy estimate; Xval, 10‐fold cross‐validation accuracy estimate. (a) 1‐day, (b) 4‐day, (c) all days models.
    4. Table 8.4 Important variables for irrigating different crops according to CART.
    1. Table 9.1 Classification error (in %) estimated for test samples.
    2. Table 9.2 The comparison of run time (in seconds) for the largest data sets from Table 9.1.
    3. Table 9.3 Standard deviations and test for variances for 30 estimations of the classification error.
    4. Table 9.4 Numbers of irrelevant variables introduced to the classifiers.
    5. Table 9.5 Classification error (in %) estimated in test samples for the Pima data set with irrelevant variables added from various distributions.
    1. Table 10.1 Borrowers according to age (images).
    2. Table 10.2 Borrowers according to place credit granted (images).
    3. Table 10.3 Borrowers according to share of the loan already repaid (images).
    4. Table 10.4 Borrowers according to the value of the credit (images).
    5. Table 10.5 Borrowers according to period of loan repayment (images).
    6. Table 10.6 Structure of samples used for further experiments. Note that images denotes the borrower who paid back the loan in time, and images otherwise.
    7. Table 10.7 Results of classification applying boosting and bagging models: the testing set.
    8. Table 10.8 Comparison of accuracy measures for the training and testing sets.
    9. Table 10.9 Comparison of accuracy measures for the training samples.
    10. Table 10.10 Comparison of accuracy measures for the testing samples.
    11. Table 10.11 Comparison of synthetic measures.
    1. Table 11.1 Average rank difference (CC‐BA) between GAM ensemble and benchmark algorithms based upon De Bock et al. (2010) (*images, **images).
    2. Table 11.2 Summary of the average performance measures over the images‐fold cross‐validation based on Coussement and De Bock (2013). Note: In any rows, performance measures that share a common subscript are not significantly different at images.
    3. Table 11.3 Average algorithm rankings and post‐hoc test results (HoIm's procedure) based on De Bock and Van den Poel (2011) (*images, **images).
    4. Table 11.4 The 10 most important features with feature importance scores based on AUC and TDL based on De Bock and Van den Poel (2011).

List of Figures

    1. Figure 1.1 Binary classification tree.
    2. Figure 1.2 Evolution of error rates depending on the size of the tree.
    3. Figure 1.3 Evolution of cross‐validation error based on the tree size and the cost‐complexity measure.
    1. Figure 2.1 Probability of error depending on the number of classifiers in the ensemble.
    2. Figure 2.2 Probability of error of the ensemble depending on the individual classifier accuracy.
    1. Figure 4.1 Cross‐validation error versus tree complexity for the iris example.
    2. Figure 4.2 Individual tree for the iris example.
    3. Figure 4.3 Variable relative importance in bagging for the iris example.
    4. Figure 4.4 Variable relative importance in boosting for the iris example.
    5. Figure 4.5 Margins for bagging in the iris example.
    6. Figure 4.6 Margins for boosting in the iris example.
    7. Figure 4.7 Error evolution in bagging for the iris example.
    8. Figure 4.8 Error evolution in boosting for the iris example.
    9. Figure 4.9 Overfitted classification tree.
    10. Figure 4.10 Cross‐validation error versus tree complexity.
    11. Figure 4.11 Pruned tree.
    12. Figure 4.12 Variable relative importance in bagging.
    13. Figure 4.13 Error evolution in bagging.
    14. Figure 4.14 Variable relative importance in boosting.
    15. Figure 4.15 Error evolution in boosting.
    16. Figure 4.16 Variable relative importance in random forest.
    17. Figure 4.17 OOB error evolution in random forest.
    1. Figure 5.1 Variable relative importance in AdaBoost.
    2. Figure 5.2 Margin cumulative distribution in AdaBoost.
    3. Figure 5.3 Structure of the pruned tree.
    4. Figure 5.4 Variable relative importance in AdaBoost.M1 (three classes).
    5. Figure 5.5 Evolution of the test error in AdaBoost.M1 (three classes).
    1. Figure 6.1 Histological images of fish species Merluccius merluccius, with cell outlines manually annotated by experts. The continuous (respectively dashed) lines are cells with (resp. without) nucleus. The images contain cells in the different states of development (hydrated, cortical alveoli, vitellogenic, and atretic).
    2. Figure 6.2 Average accuracy (in %, left panel) and Friedman rank (right panel) over all the feature vectors for the detection of the nucleus (upper panels) and stage classification (lower panels) of fish ovary cells.
    3. Figure 6.3 Maximum accuracies (in %), in decreasing order, achieved by the different classifiers for the detection of the nucleus (left panel) and stage classification (right panel) of fish ovary cells.
    4. Figure 6.4 Average accuracy (left panel, in %) and Friedman ranking (right panel, decreasing with performance) of each classifier.
    5. Figure 6.5 Accuracies achieved by Adaboost.M1, SVM, and random forest for each data set.
    6. Figure 6.6 Times achieved by the faster classifiers (DKP, SVM, and ELM) for each data set, ordered by increasing size of data set.
    7. Figure 6.7 Friedman rank (upper panel, increasing order) and average accuracies (lower panel, decreasing order) for the 25 best classifiers.
    8. Figure 6.8 Friedman rank interval for the classifiers of each family (upper panel) and minimum rank (by ascending order) for each family (lower panel).
    9. Figure 6.9 Examples of histological images of fish species Reinhardtius hippoglossoides, including oocytes with the six different development stages (PG, CA, VIT1, VIT2, VIT3, and VIT4).
    1. Figure 8.1 A tree structure.
    2. Figure 8.2 Pairs plot of some weather variables used in the tree analysis, with the intention of finding groups of similar features.
    3. Figure 8.3 CART structures for alfalfa decisions.
    4. Figure 8.4 CART structures for barley irrigation decisions.
    5. Figure 8.5 CART structures for corn irrigation decisions.
    1. Figure 10.1 Ranking of predictor importance for the boosting model evaluated for sample S1A.
    2. Figure 10.2 Ranking of predictor importance for the bagging model evaluated for sample S1A.
    1. Figure 11.1 Bootstrap confidence intervals and average trends for a selection of predictive features (from De Bock and Van den Poel (2011)).

Preface

This book introduces the reader to ensemble classifier methods by describing the most commonly used techniques. The goal is not a complete analysis of all the techniques and their applications, nor an exhaustive tour through all the subjects and aspects that come up within this field of continuous expansion. On the contrary, the aim is to show in an intuitive way how ensemble classification has arisen as an extension of the individual classifiers, and describe their basic characteristics and what kind of problems can emerge from its use. The book is therefore intended for everyone interested in starting in these fields, especially students, teachers, researchers, and people dealing with statistical classification.

To achieve these goals, the work has been structured in two sections containing a total of 11 chapters. The first section, which is more theoretical, contains the four initial chapters, including the introduction. The second section, from the fifth chapter to the end, has a much more practical nature, illustrating with examples of business failure prediction, zoology, and ecology, among others, how the previously studied techniques are applied.

After a brief introduction where the fundamental concepts of statistical classification tasks through decision trees are established, in the second chapter the generalization error is decomposed in three terms (the Bayes risk, the bias, and the variance). Moreover, the instability of classifiers is studied, dealing with the changes suffered by the classifier when it faces small changes in the training set. The three reasons proposed by Dietterich to explain the superiority of ensemble classifiers over single ones are given (statistical, computational, and representation). Finally, the Bayesian perspective is mentioned.

In the third chapter, several classifications of ensemble methods are enumerated, focusing on that which distinguishes between generative and non‐generative methods. After that, the bagging method is studied. This uses several bootstrap samples of the original set to train a set of basic classifiers that afterwards combine by majority vote. In addition, the boosting method is analysed, highlighting the most commonly used algorithm, the AdaBoost. This algorithm repeatedly applies the classification system to the training set, but focuses, in each iteration, on the most difficult examples, and later combines the built classifiers through weighted majority vote. To end this chapter, the random forest method is briefly mentioned. This generates a set of trees, introducing in their building process a degree of randomness to assure some diversity in the final ensemble.

The last chapter of the first part shows with simple applications how the individual and ensembled classification trees are applied in practice using the rpart, adabag and randomForest R packages. Moreover, the improvement in the results achieved over individual classification techniques is highlighted.

The second part begins with a chapter dealing with business failure prediction. Specifically, it compares the prediction accuracy of ensemble trees and neural networks for a set of European firms, considering the usual predicting variables such as financial ratios, as well as qualitative variables, such as firm size, activity, and legal structure. It shows that the ensemble trees decrease the generalization error by about 30% with respect to the error produced with a neural network.

The sixth chapter describes the experience of M. Fernández‐Delgado, E. Cernadas, and M. Pérez‐Ortiz using ensemble methods for classifying texture feature patterns in histological images of fish gonad cells. The results were also good compared to ordinal classifiers with stages of fish oocytes, whose development follows a natural time ordering.

In the seventh chapter W. Rejchel considers the ranking problem that is popular in the machine learning community. The goal is to predict or to guess the ordering between objects on the basis of their observed features. This work focuses on ranking estimators that are obtained by minimization of an empirical risk with a convex loss function, for instance boosting algorithms. Generalization bounds for the excess risk of ranking estimators that decrease faster than a threshold are constructed. In addition, the quality of procedures on simulated data sets is investigated.

In the eighth chapter S. Andriyas and M. McKee implement ensemble classification trees to analyze farmers' irrigation decisions and consequently forecast future decisions. Readily available data on biophysical conditions in fields and the irrigation delivery system during the growing season can be utilized to anticipate irrigation water orders in the absence of any predictive socio‐economic information that could be used to provide clues for future irrigation decisions. This can subsequently be useful in making short‐term demand forecasts.

The ninth chapter, by M. Kubus, focuses on two properties of a boosted set of rules. He discusses the stability and robustness of irrelevant variables, which can deteriorate the predictive ability of the model. He also compares the generalization errors of SLIPPER and AdaBoost.M1 in computational experiments using benchmark data and artificially generated irrelevant variables from various distributions.

The tenth chapter shows how M. Chrzanowska, E. Alfaro, and D. Witkowska apply individual and ensemble trees for credit scoring, which is a crucial problem for a bank as it has a critical influence on its financial outcome. Therefore, to assess credit risk (or a client's creditworthiness) various statistical tools may be used, including classification methods.

The aim of the last chapter by K. W. De Bock, K. Coussement, and D. Cielen is to provide an introduction to GAMs and GAM‐based ensembles, as well as an overview of experiments conducted to evaluate and benchmark the performance, and to provide insights into these novel algorithms using real‐life data sets from various application domains.

Thanks are due to all our collaborators and colleagues, especially the Economics and Business Faculty of Albacete, the Public Economy, Statistics and Economic Policy Department, the Quantitative Methods and Socio‐economic Development Group (MECYDES) at the Institute for Regional Development (IDR) and the University of Castilla‐La Mancha (UCLM). At Wiley, we would like to thank to Alison Oliver and Jemima Kingsly for their help and two anonymous reviewers for their comments.

Finally, we thank our families for their understanding and help in every moment: Nieves, Emilio, Francisco, María, Esteban, and Pilar; Matías, Clara, David, and Enrique.