Edited by
This edition first published 2019
© 2019 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Esteban Alfaro, Matías Gámez and Noelia García to be identified as the the authors of editorial material in this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: Alfaro, Esteban, 1977- editor. | Gámez, Matías, 1966- editor . |
García, Noelia, 1973- editor.
Title: Ensemble classification methods with applications in R / edited by
Esteban Alfaro, Matías Gámez, Noelia García.
Description: Hoboken, NJ : John Wiley & Sons, 2019. | Includes bibliographical references and index. |
Identifiers: LCCN 2018022257 (print) | LCCN 2018033307 (ebook) | ISBN 9781119421573 (Adobe PDF) | ISBN 9781119421559 (ePub) | ISBN 9781119421092 (hardcover)
Subjects: LCSH: Machine learning-Statistical methods. | R (Computer program
language)
Classification: LCC Q325.5 (ebook) | LCC Q325.5 .E568 2018 (print) | DDC 006.3/1-dc23
LC record available at https://lccn.loc.gov/2018022257
Cover Design: Wiley
Cover Image: Courtesy of Esteban Alfaro via wordle.net
Esteban Alfaro, Economics and Business Faculty, Institute for Regional Development, University of Castilla‐La Mancha.
Sanyogita Andriyas, Department of Civil & Environmental Engineering, Utah State University, Logan, Utah, USA.
Eva Cernadas, Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Santiago de Compostela, Spain.
Mariola Chrzanowska, Faculty of Applied Informatics and Mathematics, Department of Econometrics and Statistics, Warsaw University of Life Sciences (SGGW), Warsaw, Poland.
Davy Cielen, Maastricht School of Management, Maastricht, the Netherlands.
Kristof Coussement, IESEG Center for Marketing Analytics (ICMA), IESEG School of Management, Université Catholique de Lille, Lille, France.
Koen W. De Bock, Audencia Business School, Nantes, France.
Manuel Fernández‐Delgado, Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), University of Santiago de Compostela, Santiago de Compostela, Spain.
Matías Gámez, Institute for Regional Development, University of Castilla‐La Mancha.
Noelia García, Economics and Business Faculty, University of Castilla‐La Mancha.
Mariusz Kubus, Department of Mathematics and Computer Science Applications, Opole University of Technology, Poland.
Mac McKee, Utah Water Research Laboratory and Department of Civil & Environmental Engineering, Utah State University, Logan, Utah, USA.
María Pérez‐Ortiz, Department of Quantitative Methods, University of Loyola Andalucía, Córdoba, Spain.
Wojciech Rejchel, Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Torun, Poland; Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland.
Dorota Witkowska, Department of Finance and Strategic Management, University of Lodz, Lodz, Poland.
This book introduces the reader to ensemble classifier methods by describing the most commonly used techniques. The goal is not a complete analysis of all the techniques and their applications, nor an exhaustive tour through all the subjects and aspects that come up within this field of continuous expansion. On the contrary, the aim is to show in an intuitive way how ensemble classification has arisen as an extension of the individual classifiers, and describe their basic characteristics and what kind of problems can emerge from its use. The book is therefore intended for everyone interested in starting in these fields, especially students, teachers, researchers, and people dealing with statistical classification.
To achieve these goals, the work has been structured in two sections containing a total of 11 chapters. The first section, which is more theoretical, contains the four initial chapters, including the introduction. The second section, from the fifth chapter to the end, has a much more practical nature, illustrating with examples of business failure prediction, zoology, and ecology, among others, how the previously studied techniques are applied.
After a brief introduction where the fundamental concepts of statistical classification tasks through decision trees are established, in the second chapter the generalization error is decomposed in three terms (the Bayes risk, the bias, and the variance). Moreover, the instability of classifiers is studied, dealing with the changes suffered by the classifier when it faces small changes in the training set. The three reasons proposed by Dietterich to explain the superiority of ensemble classifiers over single ones are given (statistical, computational, and representation). Finally, the Bayesian perspective is mentioned.
In the third chapter, several classifications of ensemble methods are enumerated, focusing on that which distinguishes between generative and non‐generative methods. After that, the bagging method is studied. This uses several bootstrap samples of the original set to train a set of basic classifiers that afterwards combine by majority vote. In addition, the boosting method is analysed, highlighting the most commonly used algorithm, the AdaBoost. This algorithm repeatedly applies the classification system to the training set, but focuses, in each iteration, on the most difficult examples, and later combines the built classifiers through weighted majority vote. To end this chapter, the random forest method is briefly mentioned. This generates a set of trees, introducing in their building process a degree of randomness to assure some diversity in the final ensemble.
The last chapter of the first part shows with simple applications how the individual and ensembled classification trees are applied in practice using the rpart
, adabag
and randomForest
R packages. Moreover, the improvement in the results achieved over individual classification techniques is highlighted.
The second part begins with a chapter dealing with business failure prediction. Specifically, it compares the prediction accuracy of ensemble trees and neural networks for a set of European firms, considering the usual predicting variables such as financial ratios, as well as qualitative variables, such as firm size, activity, and legal structure. It shows that the ensemble trees decrease the generalization error by about 30% with respect to the error produced with a neural network.
The sixth chapter describes the experience of M. Fernández‐Delgado, E. Cernadas, and M. Pérez‐Ortiz using ensemble methods for classifying texture feature patterns in histological images of fish gonad cells. The results were also good compared to ordinal classifiers with stages of fish oocytes, whose development follows a natural time ordering.
In the seventh chapter W. Rejchel considers the ranking problem that is popular in the machine learning community. The goal is to predict or to guess the ordering between objects on the basis of their observed features. This work focuses on ranking estimators that are obtained by minimization of an empirical risk with a convex loss function, for instance boosting algorithms. Generalization bounds for the excess risk of ranking estimators that decrease faster than a threshold are constructed. In addition, the quality of procedures on simulated data sets is investigated.
In the eighth chapter S. Andriyas and M. McKee implement ensemble classification trees to analyze farmers' irrigation decisions and consequently forecast future decisions. Readily available data on biophysical conditions in fields and the irrigation delivery system during the growing season can be utilized to anticipate irrigation water orders in the absence of any predictive socio‐economic information that could be used to provide clues for future irrigation decisions. This can subsequently be useful in making short‐term demand forecasts.
The ninth chapter, by M. Kubus, focuses on two properties of a boosted set of rules. He discusses the stability and robustness of irrelevant variables, which can deteriorate the predictive ability of the model. He also compares the generalization errors of SLIPPER and AdaBoost.M1 in computational experiments using benchmark data and artificially generated irrelevant variables from various distributions.
The tenth chapter shows how M. Chrzanowska, E. Alfaro, and D. Witkowska apply individual and ensemble trees for credit scoring, which is a crucial problem for a bank as it has a critical influence on its financial outcome. Therefore, to assess credit risk (or a client's creditworthiness) various statistical tools may be used, including classification methods.
The aim of the last chapter by K. W. De Bock, K. Coussement, and D. Cielen is to provide an introduction to GAMs and GAM‐based ensembles, as well as an overview of experiments conducted to evaluate and benchmark the performance, and to provide insights into these novel algorithms using real‐life data sets from various application domains.
Thanks are due to all our collaborators and colleagues, especially the Economics and Business Faculty of Albacete, the Public Economy, Statistics and Economic Policy Department, the Quantitative Methods and Socio‐economic Development Group (MECYDES) at the Institute for Regional Development (IDR) and the University of Castilla‐La Mancha (UCLM). At Wiley, we would like to thank to Alison Oliver and Jemima Kingsly for their help and two anonymous reviewers for their comments.
Finally, we thank our families for their understanding and help in every moment: Nieves, Emilio, Francisco, María, Esteban, and Pilar; Matías, Clara, David, and Enrique.