This edition first published 2018
© 2018 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Samuel E. Buttrey and Lyn R. Whitaker to be identified as the authors of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data applied for
Hardback ISBN: 9781119080022
Cover Design: Wiley
Cover Image: © Nongkran_ch/Gettyimages
To Elinda and Mike
Samuel E. Buttrey received a bachelor's degree in statistics from Princeton University in 1983. After 8 years as a Wall Street computer systems analyst, he returned to graduate school and received MA and PhD degrees in statistics from the University of California at Berkeley, the latter in 1996. In that year, he joined the faculty of the Department of Operations Research at the Naval Postgraduate School in Monterey, California. He has published papers on nearest-neighbor and other classification methods and on applied problems ranging from numismatics and oceanography to human vision. He has also published papers describing his implementations of algorithms in software. His interests include classification, computationally intensive methods, and statistical graphics, and most recently, inter-point distance measures for mixed categorical and numeric data. He lives in Pacific Grove, California, with wife Elinda, son John, and some cats.
Lyn R. Whitaker received a bachelor's degree in genetics in 1978 and a PhD in statistics from the University of California, Davis, in 1985. She was an Assistant Professor in the Department of Statistics and Applied Probability at the University of California at Santa Barbara from 1985 to 1988, and joined the faculty of the Department of Operations Research at the Naval Postgraduate School in 1988. Her interests are applied statistics relevant to defense issues. These include unsupervised methods for large and messy data, the statistical aspects of reliability and survival analysis, and most recently, jointly with Buttrey, development and use of inter-point distances for mixed data types. She resides in Monterey, California, with husband Mike, father Fred, and, occasionally, children Alex, Lee, and Mary.
Statisticians use data to build models, and they use models to describe the world and to make predictions about what will happen next. There has been a large number of very good books that describe statistical modeling, but these modeling efforts usually start with a set of “clean,” well-behaved data in which nothing is missing or anomalous.
In real life, data is messy. There will be missing values, impossible values, and typographical errors. Data is gathered from multiple sources, leading to both duplication and inconsistency. Data that should be categorical is coded as numeric; data that should be numeric can appear categorical; data can be hidden inside free-form text; and data can be in the form of dates in a wide number of possible formats. We estimate that 80% of the time taken in any data analysis problem is taken up just in reading and preparing the data. So, any analyst needs to know how to acquire data and how to prepare it for modeling, and the steps taken should be automatic, as far as possible, and reproducible.
This book describes how to handle data using the R software. R is the most widely used software in statistics, and it has the advantage of being free, open-source, and available on every major computing platform. Whatever software you use, you will find yourself facing the issues of acquiring, cleaning, and merging data, and documenting the steps you took. We hope this book will help you do these things efficiently.
Sam Buttrey and Lyn Whitaker
Monterey, California, USA
November 30, 2016
Our book is about how to use R to process data. We use R because it is powerful, versatile, and extensible. We thank the developers of R for their service to the statistical community for producing a high-quality open-source piece of software. We also thank the long list of colleagues and students who have helped frame our thinking about questions of statistics and data.
Don't forget to visit the companion website for this book:
www.wiley.com/go/buttrey/datascientistsguide
There you will find valuable material designed to enhance your learning, including: