Cover
List of Contributors
List of Tables
List of Figures
Preface
Chapter 1: Introduction
1. 1.1 Introduction
2. 1.2 Definition
3. 1.3 Taxonomy of Supervised Classification Methods
4. 1.4 Estimation of the Accuracy of a Classification System
5. 1.5 Classification Trees
Chapter 2: Limitation of the Individual Classifiers
1. 2.1 Introduction
2. 2.2 Error Decomposition: Bias and Variance
3. 2.3 Study of Classifier Instability
4. 2.4 Advantages of Ensemble Classifiers
5. 2.5 Bayesian Perspective of Ensemble Classifiers
Chapter 3: Ensemble Classifiers Methods
1. 3.1 Introduction
2. 3.2 Taxonomy of Ensemble Methods
3. 3.3 Bagging
4. 3.4 Boosting
5. 3.5 Random Forests
Chapter 4: Classification with Individual and Ensemble Trees in R
1. 4.1 Introduction
2. 4.2 adabag: An R Package for Classification with Boosting and Bagging
3. 4.3 The “German Credit” Example
Chapter 5: Bankruptcy Prediction Through Ensemble Trees
1. 5.1 Introduction
2. 5.2 Problem Description
3. 5.3 Applications
4. 5.4 Conclusions
Chapter 6: Experiments with Adabag in Biology Classification Tasks
1. 6.1 Classification of Color Texture Feature Patterns Extracted From Cells in Histological Images of Fish Ovary
2. 6.2 Direct Kernel Perceptron: Ultra‐Fast Kernel ELM‐Based Classification with Non‐Iterative Closed‐Form Weight Calculation
3. 6.3 Do We Need Hundreds of Classifiers to Solve Real‐World Classification Problems?
4. 6.4 On the use of Nominal and Ordinal Classifiers for the Discrimination of Stages of Development in Fish Oocytes
Chapter 7: Generalization Bounds for Ranking Algorithms
1. 7.1 Introduction
2. 7.2 Assumptions, Main Theorem, and Application
3. 7.3 Experiments
4. 7.4 Conclusions
Chapter 8: Classification and Regression Trees for Analyzing Irrigation Decisions
1. 8.1 Introduction
2. 8.2 Theory
3. 8.3 Case Study and Methods
4. 8.4 Results and Discussion
5. 8.5 Conclusions
Chapter 9: Boosted Rule Learner and its Properties
1. 9.1 Introduction
2. 9.2 Separate‐and‐Conquer
3. 9.3 Boosting in Rule Induction
4. 9.4 Experiments
5. 9.5 Conclusions
Chapter 10: Credit Scoring with Individuals and Ensemble Trees
1. 10.1 Introduction
2. 10.2 Measures of Accuracy
3. 10.3 Data Description
4. 10.4 Classification of Borrowers Applying Ensemble Trees
5. 10.5 Conclusions
Chapter 11: An Overview of Multiple Classifier Systems Based on Generalized Additive Models
1. 11.1 Introduction
2. 11.2 Multiple Classifier Systems Based on GAMs
3. 11.3 Experiments and Applications
4. 11.4 Software Implementation in R: the GAMens Package
5. 11.5 Conclusions
References
Index
End User License Agreement

List of Tables

Chapter 1
1. Table 1.1 Comparison of real error rate estimation methods.
Chapter 2
1. Table 2.1 Error decomposition of a classifier (Tibshirani, 1996a).
Chapter 3
1. Table 3.1 Example of the weight updating process in AdaBoost.
2. Table 3.2 Example of the weight updating process in AdaBoost.M1.
Chapter 5
1. Table 5.1 Description of variables (some of these ratios are explained in White et al. (2003)).
2. Table 5.2 Results obtained from descriptive analysis and ANOVA (SW test = Shapiro‐Wilks test; KS test = Kolmogorov‐Smirnov test).
3. Table 5.3 Correlation matrix.
4. Table 5.4 Unstandardized coefficients of the canonical discriminant function.
5. Table 5.5 Confusion matrix and errors with LDA.
6. Table 5.6 Confusion matrix and errors with an artificial neural network.
7. Table 5.7 Sensitive analysis.
8. Table 5.8 Confusion matrix and errors with AdaBoost.
9. Table 5.9 Comparison of results with other methods.
10. Table 5.10 Confusion matrix and errors with the pruned tree.
11. Table 5.11 Confusion matrix and errors with the AdaBoost.M1 model.
Chapter 6
1. Table 6.1 Collection of 121 data sets from the UCI database and our real‐world problems. ac‐inflam, acute inflammation; bc, breast cancer; congress‐voting, congressional voting; ctg, cardiotocography; conn‐bench‐sonar, connectionist benchmark sonar mines rocks; conn‐bench‐vowel, connectionist benchmark vowel deterding; pb, Pittsburg bridges; st, statlog; vc, vertebral column.
2. Table 6.2 Friedman ranking, average accuracy and Cohen (both in %) for the 30 best classifiers, ordered by increasing Friedman ranking. BG, bagging; MAB, MultiBoostAB; RC, RandomCommittee.
3. Table 6.3 Classification results: accuracy, Cohen (both in %), mean absolute error (MAE), Kendall , and Spearman for species MC and TL with three stages.
4. Table 6.4 Classification results for the species RH with six stages using the LOIO and MIX methodologies.
5. Table 6.5 Confusion matrices and sensitivities/positive predictivities for each stage (in %) achieved by SVMOD and GSVM for species RH and LOIO experiments.
Chapter 7
1. Table 7.1 Errors of estimators.
Chapter 8
1. Table 8.1 Predictor variables, the represented factors as seen by the farmer, and the target variable used for trees analysis.
2. Table 8.2 The cost‐complexity parameter (CP), relative error, cross‐validation error (xerror), and cross‐validation standard deviation (xstd) for trees with nsplit from 0 to 8.
3. Table 8.3 Accuracy estimates on test data for CART models. Resub, resubstitution accuracy estimate; Xval, 10‐fold cross‐validation accuracy estimate. (a) 1‐day, (b) 4‐day, (c) all days models.
4. Table 8.4 Important variables for irrigating different crops according to CART.
Chapter 9
1. Table 9.1 Classification error (in %) estimated for test samples.
2. Table 9.2 The comparison of run time (in seconds) for the largest data sets from Table 9.1.
3. Table 9.3 Standard deviations and test for variances for 30 estimations of the classification error.
4. Table 9.4 Numbers of irrelevant variables introduced to the classifiers.
5. Table 9.5 Classification error (in %) estimated in test samples for the Pima data set with irrelevant variables added from various distributions.
Chapter 10
1. Table 10.1 Borrowers according to age ( ).
2. Table 10.2 Borrowers according to place credit granted ( ).
3. Table 10.3 Borrowers according to share of the loan already repaid ( ).
4. Table 10.4 Borrowers according to the value of the credit ( ).
5. Table 10.5 Borrowers according to period of loan repayment ( ).
6. Table 10.6 Structure of samples used for further experiments. Note that denotes the borrower who paid back the loan in time, and otherwise.
7. Table 10.7 Results of classification applying boosting and bagging models: the testing set.
8. Table 10.8 Comparison of accuracy measures for the training and testing sets.
9. Table 10.9 Comparison of accuracy measures for the training samples.
10. Table 10.10 Comparison of accuracy measures for the testing samples.
11. Table 10.11 Comparison of synthetic measures.
Chapter 11
1. Table 11.1 Average rank difference (CC‐BA) between GAM ensemble and benchmark algorithms based upon De Bock et al. (2010) (* , ** ).
2. Table 11.2 Summary of the average performance measures over the ‐fold cross‐validation based on Coussement and De Bock (2013). Note: In any rows, performance measures that share a common subscript are not significantly different at .
3. Table 11.3 Average algorithm rankings and post‐hoc test results (HoIm's procedure) based on De Bock and Van den Poel (2011) (* , ** ).
4. Table 11.4 The 10 most important features with feature importance scores based on AUC and TDL based on De Bock and Van den Poel (2011).

List of Illustrations

Chapter 1
1. Figure 1.1 Binary classification tree.
2. Figure 1.2 Evolution of error rates depending on the size of the tree.
3. Figure 1.3 Evolution of cross‐validation error based on the tree size and the cost‐complexity measure.
Chapter 2
1. Figure 2.1 Probability of error depending on the number of classifiers in the ensemble.
2. Figure 2.2 Probability of error of the ensemble depending on the individual classifier accuracy.
Chapter 4
1. Figure 4.1 Cross‐validation error versus tree complexity for the iris example.
2. Figure 4.2 Individual tree for the iris example.
3. Figure 4.3 Variable relative importance in bagging for the iris example.
4. Figure 4.4 Variable relative importance in boosting for the iris example.
5. Figure 4.5 Margins for bagging in the iris example.
6. Figure 4.6 Margins for boosting in the iris example.
7. Figure 4.7 Error evolution in bagging for the iris example.
8. Figure 4.8 Error evolution in boosting for the iris example.
9. Figure 4.9 Overfitted classification tree.
10. Figure 4.10 Cross‐validation error versus tree complexity.
11. Figure 4.11 Pruned tree.
12. Figure 4.12 Variable relative importance in bagging.
13. Figure 4.13 Error evolution in bagging.
14. Figure 4.14 Variable relative importance in boosting.
15. Figure 4.15 Error evolution in boosting.
16. Figure 4.16 Variable relative importance in random forest.
17. Figure 4.17 OOB error evolution in random forest.
Chapter 5
1. Figure 5.1 Variable relative importance in AdaBoost.
2. Figure 5.2 Margin cumulative distribution in AdaBoost.
3. Figure 5.3 Structure of the pruned tree.
4. Figure 5.4 Variable relative importance in AdaBoost.M1 (three classes).
5. Figure 5.5 Evolution of the test error in AdaBoost.M1 (three classes).
Chapter 6
1. Figure 6.1 Histological images of fish species Merluccius merluccius, with cell outlines manually annotated by experts. The continuous (respectively dashed) lines are cells with (resp. without) nucleus. The images contain cells in the different states of development (hydrated, cortical alveoli, vitellogenic, and atretic).
2. Figure 6.2 Average accuracy (in %, left panel) and Friedman rank (right panel) over all the feature vectors for the detection of the nucleus (upper panels) and stage classification (lower panels) of fish ovary cells.
3. Figure 6.3 Maximum accuracies (in %), in decreasing order, achieved by the different classifiers for the detection of the nucleus (left panel) and stage classification (right panel) of fish ovary cells.
4. Figure 6.4 Average accuracy (left panel, in %) and Friedman ranking (right panel, decreasing with performance) of each classifier.
5. Figure 6.5 Accuracies achieved by Adaboost.M1, SVM, and random forest for each data set.
6. Figure 6.6 Times achieved by the faster classifiers (DKP, SVM, and ELM) for each data set, ordered by increasing size of data set.
7. Figure 6.7 Friedman rank (upper panel, increasing order) and average accuracies (lower panel, decreasing order) for the 25 best classifiers.
8. Figure 6.8 Friedman rank interval for the classifiers of each family (upper panel) and minimum rank (by ascending order) for each family (lower panel).
9. Figure 6.9 Examples of histological images of fish species Reinhardtius hippoglossoides, including oocytes with the six different development stages (PG, CA, VIT1, VIT2, VIT3, and VIT4).
Chapter 8
1. Figure 8.1 A tree structure.
2. Figure 8.2 Pairs plot of some weather variables used in the tree analysis, with the intention of finding groups of similar features.
3. Figure 8.3 CART structures for alfalfa decisions.
4. Figure 8.4 CART structures for barley irrigation decisions.
5. Figure 8.5 CART structures for corn irrigation decisions.
Chapter 10
1. Figure 10.1 Ranking of predictor importance for the boosting model evaluated for sample S1A.
2. Figure 10.2 Ranking of predictor importance for the bagging model evaluated for sample S1A.
Chapter 11
1. Figure 11.1 Bootstrap confidence intervals and average trends for a selection of predictive features (from De Bock and Van den Poel (2011)).

Copyright

This edition first published 2019

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Esteban Alfaro, Matías Gámez and Noelia García to be identified as the the authors of editorial material in this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Alfaro, Esteban, 1977- editor. | Gámez, Matías, 1966- editor . |

García, Noelia, 1973- editor.

Title: Ensemble classification methods with applications in R / edited by

Esteban Alfaro, Matías Gámez, Noelia García.

Description: Hoboken, NJ : John Wiley & Sons, 2019. | Includes bibliographical references and index. |

Identifiers: LCCN 2018022257 (print) | LCCN 2018033307 (ebook) | ISBN 9781119421573 (Adobe PDF) | ISBN 9781119421559 (ePub) | ISBN 9781119421092 (hardcover)

Subjects: LCSH: Machine learning-Statistical methods. | R (Computer program

language)

Classification: LCC Q325.5 (ebook) | LCC Q325.5 .E568 2018 (print) | DDC 006.3/1-dc23

LC record available at https://lccn.loc.gov/2018022257

Cover Design: Wiley

Cover Image: Courtesy of Esteban Alfaro via wordle.net

List of Contributors

Esteban Alfaro, Economics and Business Faculty, Institute for Regional Development, University of Castilla‐La Mancha.

Sanyogita Andriyas, Department of Civil & Environmental Engineering, Utah State University, Logan, Utah, USA.

Eva Cernadas, Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Santiago de Compostela, Spain.

Mariola Chrzanowska, Faculty of Applied Informatics and Mathematics, Department of Econometrics and Statistics, Warsaw University of Life Sciences (SGGW), Warsaw, Poland.

Davy Cielen, Maastricht School of Management, Maastricht, the Netherlands.

Kristof Coussement, IESEG Center for Marketing Analytics (ICMA), IESEG School of Management, Université Catholique de Lille, Lille, France.

Koen W. De Bock, Audencia Business School, Nantes, France.

Manuel Fernández‐Delgado, Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Santiago de Compostela, Spain.

Matías Gámez, Institute for Regional Development, University of Castilla‐La Mancha.

Noelia García, Economics and Business Faculty, University of Castilla‐La Mancha.

Mariusz Kubus, Department of Mathematics and Computer Science Applications, Opole University of Technology, Poland.

Mac McKee, Utah Water Research Laboratory and Department of Civil & Environmental Engineering, Utah State University, Logan, Utah, USA.

María Pérez‐Ortiz, Department of Quantitative Methods, University of Loyola Andalucía, Córdoba, Spain.

Wojciech Rejchel, Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Torun, Poland; Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland.

Dorota Witkowska, Department of Finance and Strategic Management, University of Lodz, Lodz, Poland.

List of Tables

1. Table 1.1 Comparison of real error rate estimation methods.
1. Table 2.1 Error decomposition of a classifier (Tibshirani, 1996a).
1. Table 3.1 Example of the weight updating process in AdaBoost.
2. Table 3.2 Example of the weight updating process in AdaBoost.M1.
1. Table 5.1 Description of variables (some of these ratios are explained in White et al. (2003)).
2. Table 5.2 Results obtained from descriptive analysis and ANOVA (SW test = Shapiro‐Wilks test; KS test = Kolmogorov‐Smirnov test).
3. Table 5.3 Correlation matrix.
4. Table 5.4 Unstandardized coefficients of the canonical discriminant function.
5. Table 5.5 Confusion matrix and errors with LDA.
6. Table 5.6 Confusion matrix and errors with an artificial neural network.
7. Table 5.7 Sensitive analysis.
8. Table 5.8 Confusion matrix and errors with AdaBoost.
9. Table 5.9 Comparison of results with other methods.
10. Table 5.10 Confusion matrix and errors with the pruned tree.
11. Table 5.11 Confusion matrix and errors with the AdaBoost.M1 model.
1. Table 6.1 Collection of 121 data sets from the UCI database and our real‐world problems. ac‐inflam, acute inflammation; bc, breast cancer; congress‐voting, congressional voting; ctg, cardiotocography; conn‐bench‐sonar, connectionist benchmark sonar mines rocks; conn‐bench‐vowel, connectionist benchmark vowel deterding; pb, Pittsburg bridges; st, statlog; vc, vertebral column.
2. Table 6.2 Friedman ranking, average accuracy and Cohen (both in %) for the 30 best classifiers, ordered by increasing Friedman ranking. BG, bagging; MAB, MultiBoostAB; RC, RandomCommittee.
3. Table 6.3 Classification results: accuracy, Cohen (both in %), mean absolute error (MAE), Kendall , and Spearman for species MC and TL with three stages.
4. Table 6.4 Classification results for the species RH with six stages using the LOIO and MIX methodologies.
5. Table 6.5 Confusion matrices and sensitivities/positive predictivities for each stage (in %) achieved by SVMOD and GSVM for species RH and LOIO experiments.
1. Table 7.1 Errors of estimators.
1. Table 8.1 Predictor variables, the represented factors as seen by the farmer, and the target variable used for trees analysis.
2. Table 8.2 The cost‐complexity parameter (CP), relative error, cross‐validation error (xerror), and cross‐validation standard deviation (xstd) for trees with nsplit from 0 to 8.
3. Table 8.3 Accuracy estimates on test data for CART models. Resub, resubstitution accuracy estimate; Xval, 10‐fold cross‐validation accuracy estimate. (a) 1‐day, (b) 4‐day, (c) all days models.
4. Table 8.4 Important variables for irrigating different crops according to CART.
1. Table 9.1 Classification error (in %) estimated for test samples.
2. Table 9.2 The comparison of run time (in seconds) for the largest data sets from Table 9.1.
3. Table 9.3 Standard deviations and test for variances for 30 estimations of the classification error.
4. Table 9.4 Numbers of irrelevant variables introduced to the classifiers.
5. Table 9.5 Classification error (in %) estimated in test samples for the Pima data set with irrelevant variables added from various distributions.
1. Table 10.1 Borrowers according to age ().
2. Table 10.2 Borrowers according to place credit granted ().
3. Table 10.3 Borrowers according to share of the loan already repaid ().
4. Table 10.4 Borrowers according to the value of the credit ().
5. Table 10.5 Borrowers according to period of loan repayment ().
6. Table 10.6 Structure of samples used for further experiments. Note that denotes the borrower who paid back the loan in time, and otherwise.
7. Table 10.7 Results of classification applying boosting and bagging models: the testing set.
8. Table 10.8 Comparison of accuracy measures for the training and testing sets.
9. Table 10.9 Comparison of accuracy measures for the training samples.
10. Table 10.10 Comparison of accuracy measures for the testing samples.
11. Table 10.11 Comparison of synthetic measures.
1. Table 11.1 Average rank difference (CC‐BA) between GAM ensemble and benchmark algorithms based upon De Bock et al. (2010) (*, **).
2. Table 11.2 Summary of the average performance measures over the ‐fold cross‐validation based on Coussement and De Bock (2013). Note: In any rows, performance measures that share a common subscript are not significantly different at .
3. Table 11.3 Average algorithm rankings and post‐hoc test results (HoIm's procedure) based on De Bock and Van den Poel (2011) (*, **).
4. Table 11.4 The 10 most important features with feature importance scores based on AUC and TDL based on De Bock and Van den Poel (2011).

List of Figures

1. Figure 1.1 Binary classification tree.
2. Figure 1.2 Evolution of error rates depending on the size of the tree.
3. Figure 1.3 Evolution of cross‐validation error based on the tree size and the cost‐complexity measure.
1. Figure 2.1 Probability of error depending on the number of classifiers in the ensemble.
2. Figure 2.2 Probability of error of the ensemble depending on the individual classifier accuracy.
1. Figure 4.1 Cross‐validation error versus tree complexity for the iris example.
2. Figure 4.2 Individual tree for the iris example.
3. Figure 4.3 Variable relative importance in bagging for the iris example.
4. Figure 4.4 Variable relative importance in boosting for the iris example.
5. Figure 4.5 Margins for bagging in the iris example.
6. Figure 4.6 Margins for boosting in the iris example.
7. Figure 4.7 Error evolution in bagging for the iris example.
8. Figure 4.8 Error evolution in boosting for the iris example.
9. Figure 4.9 Overfitted classification tree.
10. Figure 4.10 Cross‐validation error versus tree complexity.
11. Figure 4.11 Pruned tree.
12. Figure 4.12 Variable relative importance in bagging.
13. Figure 4.13 Error evolution in bagging.
14. Figure 4.14 Variable relative importance in boosting.
15. Figure 4.15 Error evolution in boosting.
16. Figure 4.16 Variable relative importance in random forest.
17. Figure 4.17 OOB error evolution in random forest.
1. Figure 5.1 Variable relative importance in AdaBoost.
2. Figure 5.2 Margin cumulative distribution in AdaBoost.
3. Figure 5.3 Structure of the pruned tree.
4. Figure 5.4 Variable relative importance in AdaBoost.M1 (three classes).
5. Figure 5.5 Evolution of the test error in AdaBoost.M1 (three classes).
1. Figure 6.1 Histological images of fish species Merluccius merluccius, with cell outlines manually annotated by experts. The continuous (respectively dashed) lines are cells with (resp. without) nucleus. The images contain cells in the different states of development (hydrated, cortical alveoli, vitellogenic, and atretic).
2. Figure 6.2 Average accuracy (in %, left panel) and Friedman rank (right panel) over all the feature vectors for the detection of the nucleus (upper panels) and stage classification (lower panels) of fish ovary cells.
3. Figure 6.3 Maximum accuracies (in %), in decreasing order, achieved by the different classifiers for the detection of the nucleus (left panel) and stage classification (right panel) of fish ovary cells.
4. Figure 6.4 Average accuracy (left panel, in %) and Friedman ranking (right panel, decreasing with performance) of each classifier.
5. Figure 6.5 Accuracies achieved by Adaboost.M1, SVM, and random forest for each data set.
6. Figure 6.6 Times achieved by the faster classifiers (DKP, SVM, and ELM) for each data set, ordered by increasing size of data set.
7. Figure 6.7 Friedman rank (upper panel, increasing order) and average accuracies (lower panel, decreasing order) for the 25 best classifiers.
8. Figure 6.8 Friedman rank interval for the classifiers of each family (upper panel) and minimum rank (by ascending order) for each family (lower panel).
9. Figure 6.9 Examples of histological images of fish species Reinhardtius hippoglossoides, including oocytes with the six different development stages (PG, CA, VIT1, VIT2, VIT3, and VIT4).
1. Figure 8.1 A tree structure.
2. Figure 8.2 Pairs plot of some weather variables used in the tree analysis, with the intention of finding groups of similar features.
3. Figure 8.3 CART structures for alfalfa decisions.
4. Figure 8.4 CART structures for barley irrigation decisions.
5. Figure 8.5 CART structures for corn irrigation decisions.
1. Figure 10.1 Ranking of predictor importance for the boosting model evaluated for sample S1A.
2. Figure 10.2 Ranking of predictor importance for the bagging model evaluated for sample S1A.
1. Figure 11.1 Bootstrap confidence intervals and average trends for a selection of predictive features (from De Bock and Van den Poel (2011)).

Preface

This book introduces the reader to ensemble classifier methods by describing the most commonly used techniques. The goal is not a complete analysis of all the techniques and their applications, nor an exhaustive tour through all the subjects and aspects that come up within this field of continuous expansion. On the contrary, the aim is to show in an intuitive way how ensemble classification has arisen as an extension of the individual classifiers, and describe their basic characteristics and what kind of problems can emerge from its use. The book is therefore intended for everyone interested in starting in these fields, especially students, teachers, researchers, and people dealing with statistical classification.

To achieve these goals, the work has been structured in two sections containing a total of 11 chapters. The first section, which is more theoretical, contains the four initial chapters, including the introduction. The second section, from the fifth chapter to the end, has a much more practical nature, illustrating with examples of business failure prediction, zoology, and ecology, among others, how the previously studied techniques are applied.

After a brief introduction where the fundamental concepts of statistical classification tasks through decision trees are established, in the second chapter the generalization error is decomposed in three terms (the Bayes risk, the bias, and the variance). Moreover, the instability of classifiers is studied, dealing with the changes suffered by the classifier when it faces small changes in the training set. The three reasons proposed by Dietterich to explain the superiority of ensemble classifiers over single ones are given (statistical, computational, and representation). Finally, the Bayesian perspective is mentioned.

In the third chapter, several classifications of ensemble methods are enumerated, focusing on that which distinguishes between generative and non‐generative methods. After that, the bagging method is studied. This uses several bootstrap samples of the original set to train a set of basic classifiers that afterwards combine by majority vote. In addition, the boosting method is analysed, highlighting the most commonly used algorithm, the AdaBoost. This algorithm repeatedly applies the classification system to the training set, but focuses, in each iteration, on the most difficult examples, and later combines the built classifiers through weighted majority vote. To end this chapter, the random forest method is briefly mentioned. This generates a set of trees, introducing in their building process a degree of randomness to assure some diversity in the final ensemble.

The last chapter of the first part shows with simple applications how the individual and ensembled classification trees are applied in practice using the rpart, adabag and randomForest R packages. Moreover, the improvement in the results achieved over individual classification techniques is highlighted.

The second part begins with a chapter dealing with business failure prediction. Specifically, it compares the prediction accuracy of ensemble trees and neural networks for a set of European firms, considering the usual predicting variables such as financial ratios, as well as qualitative variables, such as firm size, activity, and legal structure. It shows that the ensemble trees decrease the generalization error by about 30% with respect to the error produced with a neural network.

The sixth chapter describes the experience of M. Fernández‐Delgado, E. Cernadas, and M. Pérez‐Ortiz using ensemble methods for classifying texture feature patterns in histological images of fish gonad cells. The results were also good compared to ordinal classifiers with stages of fish oocytes, whose development follows a natural time ordering.

In the seventh chapter W. Rejchel considers the ranking problem that is popular in the machine learning community. The goal is to predict or to guess the ordering between objects on the basis of their observed features. This work focuses on ranking estimators that are obtained by minimization of an empirical risk with a convex loss function, for instance boosting algorithms. Generalization bounds for the excess risk of ranking estimators that decrease faster than a threshold are constructed. In addition, the quality of procedures on simulated data sets is investigated.

In the eighth chapter S. Andriyas and M. McKee implement ensemble classification trees to analyze farmers' irrigation decisions and consequently forecast future decisions. Readily available data on biophysical conditions in fields and the irrigation delivery system during the growing season can be utilized to anticipate irrigation water orders in the absence of any predictive socio‐economic information that could be used to provide clues for future irrigation decisions. This can subsequently be useful in making short‐term demand forecasts.

The ninth chapter, by M. Kubus, focuses on two properties of a boosted set of rules. He discusses the stability and robustness of irrelevant variables, which can deteriorate the predictive ability of the model. He also compares the generalization errors of SLIPPER and AdaBoost.M1 in computational experiments using benchmark data and artificially generated irrelevant variables from various distributions.

The tenth chapter shows how M. Chrzanowska, E. Alfaro, and D. Witkowska apply individual and ensemble trees for credit scoring, which is a crucial problem for a bank as it has a critical influence on its financial outcome. Therefore, to assess credit risk (or a client's creditworthiness) various statistical tools may be used, including classification methods.

The aim of the last chapter by K. W. De Bock, K. Coussement, and D. Cielen is to provide an introduction to GAMs and GAM‐based ensembles, as well as an overview of experiments conducted to evaluate and benchmark the performance, and to provide insights into these novel algorithms using real‐life data sets from various application domains.

Thanks are due to all our collaborators and colleagues, especially the Economics and Business Faculty of Albacete, the Public Economy, Statistics and Economic Policy Department, the Quantitative Methods and Socio‐economic Development Group (MECYDES) at the Institute for Regional Development (IDR) and the University of Castilla‐La Mancha (UCLM). At Wiley, we would like to thank to Alison Oliver and Jemima Kingsly for their help and two anonymous reviewers for their comments.

Finally, we thank our families for their understanding and help in every moment: Nieves, Emilio, Francisco, María, Esteban, and Pilar; Matías, Clara, David, and Enrique.