Second Edition
This second edition first published 2019
© 2019 John Wiley & Sons Ltd
Edition History
Blackwell. (1e 2008)
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Helmut F. van Emden to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: van Emden, H. F. (Helmut Fritz), author.
Title: Statistics for terrified biologists / Helmut F. van Emden (emeritus
professor of horticulture, School of Biological Sciences, The University
of Reading, UK).
Description: 2nd edition. | Hoboken, NJ : Wiley, 2019. | Includes
bibliographical references and index. |
Identifiers: LCCN 2019001708 (print) | LCCN 2019002653 (ebook) | ISBN
9781119563693 (Adobe PDF) | ISBN 9781119563686 (ePub) | ISBN 9781119563679
(pbk.)
Subjects: LCSH: Biometry–Textbooks. | Statistics–Textbooks.
Classification: LCC QH323.5 (ebook) | LCC QH323.5 .V33 2019 (print) | DDC
570.1/5195–dc23
LC record available at https://lccn.loc.gov/2019001708
Cover Design: Wiley
Cover Image: © gulfu photography/Getty Images
I have been astounded by the positive reception the first edition of my little book has received. It has clearly filled a need; students from many different countries have emailed me just to say ‘Thank you for writing the book.’ Student reviews on Amazon have given my book high praise, and colleagues teaching statistics to biologists have also found it of value. One review complained about the ‘lack of maths’. I take this as a compliment, and the review arguing that the availability of computer packages makes it unnecessary to be able to carry out the calculations ‘long‐hand’ seems to me to completely miss the point!
But I have to hang my head in shame at the numerous errors there were in the first printing of 2008. Particularly the numerical examples were littered with computational errors (not infrequently in the form of dyslexic transpositions). The book went through several iterations before going to press, and I really should have rechecked all the calculations in the final proof. I can only apologise. However, these many errors led to a very encouraging outcome. The errors were largely identified by users, who then contacted me convinced they had found an error. The statistics teaching I received never gave me similar confidence. That students were sure they were right and that I had made an error is almost the best evidence of the success of my book that I could have asked for.
So what's new in this second edition? The statistics that my book tries to teach are the concepts elaborated by R.A. Fisher in the 1930s, and today these remain the basis of the statistical procedures based on the normal distribution. I can tweak the English, change sections to make them more easily understood, and add some further extensions of Fisher's concepts – and all this I have done. I have been surprised how I, as a reader of the text of the first edition, have found sentences completely incomprehensible that were written by me as author 10 years ago! I have used the opportunity of the second edition to try to clarify these passages for both you and me! I have greatly revised and, I hope, simplified the chapter on linear regression and correlation and added some material to the chapter on chi‐square tests. However, the most significant addition has been that analysis of covariance, just briefly mentioned in the first edition, has been allocated a chapter of its own. It deserves emphasis as a valuable statistical technique, and I find most textbooks shroud it in mystery. Although I would advise you to use computer programs if you want to use analysis of covariance, I thought it a good idea to demystify it as much as I can. The calculation of numbers in analysis of covariance is usually presented as having no connection with more familiar calculations, yet they come from the standard techniques of analysis of variance and regression. The point is that only some of the resulting values are then used while the rest are discarded, but doing the analyses that include these redundant numbers does make the whole thing much more comprehensible.
I can only hope that this second edition proves as popular as the first.
Helmut F. van Emden
July 2018
I have written/edited several books on my own speciality, agricultural and horticultural entomology, but always at the request of a publisher or colleagues. This book is different in two ways. Firstly, it is a book I have positively wanted to write for many years and secondly, I am stepping out of my ‘comfort zone’ in doing so.
The origins of this book stem from my appointment to the Horticulture Department of Reading University under Professor O. V. S. Heath, FRS. Professor Heath appreciated the importance of statistics and, at a time when there was no University‐wide provision of statistics teaching, he taught the final year students himself. I became the ‘assistant’ whose role it was to run the practical exercises which followed the Professor's lecture. You cannot teach what you do not understand yourself, but I tried nonetheless.
Eventually I took over the entire course. By then it was taught in the second year and in the third year the students went on to take a Faculty‐wide course. I did not continue the lectures; the whole course was in the laboratory where I led the students (using pocket calculators) through the calculations in stages. The laboratory class environment involved continuous interaction with students in a way totally different from what happens in lectures, and it rapidly became clear to me that many biologists have their neurons wired up in a way that makes the traditional way of teaching statistics rather difficult for them.
What my students needed was confidence – confidence that statistical ideas and methods were not just theory, but actually worked with real biological data and, above all, had some basis in logic! As the years of teaching went on, I began to realise that the students regularly found the same steps a barrier to progress and a damage to their confidence. Year after year I tried new ways to help them over these ‘crisis points’; eventually I succeeded with all of them, I am told.
The efficacy of my unusual teaching aids can actually be quantified. After the Faculty course taught by professional statisticians, my students were formally examined together with cohorts of students from other departments in the Faculty (then of ‘Agriculture and Food’) who had attended the same third year course in Applied Statistics. My students mostly (perhaps all but three per year out of some 20) appeared in the mark list as a distinct block right at the upper end, with percentage marks in the 70s, 80s and even 90s. Although there may have also been one or two from other courses with high marks, there then tended to be a gap till marks in the lower 60s appeared and began a continuum down to single figures.
I therefore feel confident that this book will be helpful to biologists with its mnemonics such as SqADS and ‘you go along the corridor before you go upstairs’ Other things previously unheard of are the ‘lead line’ and ‘supertotals’ with their ‘subscripts’ – yet all have been appreciated as most helpful by my students over the years. A riffle through the pages will amaze – where are the equations and algebraic symbols? They have to a large extent been replaced by numbers and words. The biologists I taught – and I don't think they were atypical – could work out what to do with a ‘45’, but rarely what to do with an ‘x’. Also, I have found that there are a number of statistical principles students easily forget, and then inevitably run into trouble with their calculations. These concepts are marked with the symbol of a small elephant .
The book limits itself to the basic foundations of parametric statistics, the t‐test, analysis of variance, linear regression and chi‐square. However, the reader is guided as to where there are important extensions of these techniques, and there is an introduction to non‐parametric tests which includes a check list of non‐parametric methods linked to their parametric counterparts. Many chapters end with an ‘executive summary’ as a quick source for revision, and there are additional exercises to give the practice which is so essential to learning.
In order to minimise algebra, the calculations are explained with numerical examples. These, as well as the ‘spare‐time activity’ exercises have come from many sources, and I regret the origin of many has become lost in the mists of time. Quite a number come from experiments carried out by Horticulture students at Reading as part of their second year outdoor practicals, and others have been totally fabricated in order to ‘work out’ well. Others have had numbers or treatments changed better to fit what was needed. I can only apologise to anyone whose data I have used without due acknowledgement; failure to do so is not intentional. But please remember that data have often been fabricated or massaged – therefore do not rely on the results as scientific evidence for what they appear to show!
Today, computer programmes take most of the sweat out of statistical procedures, and most biologists have access to professional statisticians. ‘Why bother to learn basic statistics?’ is therefore a perfectly fair question, akin to ‘Why keep a dog and bark?’ The question deserves an answer; to save repetition, my answer can be found towards the end of Chapter 1.
I am immensely grateful to the generations of Reading students who have challenged me to overcome their ‘hang‐ups’ and who have therefore contributed substantially to any success this book achieves. Also many postgraduate students as well as experienced visiting overseas scientists have encouraged me to turn my course into book form. My love and special thanks go to my wife Gillian who, with her own experience of biological statistics, has supported and encouraged me in writing this book; it is to her that I owe the imaginative title for the book.
Finally, I should like to thank Ward Cooper of Blackwells for having faith in this biologist, who is less terrified of statistics than he once was.
Helmut F. van Emden
December 2006
Don't be misled! This book cannot replace effort on your part. All it can aspire to do is to make that effort effective. The detective thriller only succeeds because you have read it too fast and not really concentrated – with that approach, you'll find this book just as mysterious.
In fact, you may not get very far if you just read this book at any speed! You will only succeed if you interact with the text, and how you might do this is the topic of most of this chapter.
The chapters, particularly 2–8, develop a train of thought essential to the subject of analysing biological data. You just have to take these chapters in order and quite slowly. There is only one way I know for you to maintain the concentration necessary to comprehension, and that is for you to make your own summary notes as you go along.
My Head of Department when I first joined the staff at Reading used to define a university lecture as ‘a technique for transferring information from a piece of paper in front of the lecturer to a piece of paper in front of the student, without passing through the heads of either’. That's why I stress making your own summary notes. You will retain very little by just reading the text; you'll find that after a while you've been thinking about something totally different while ‘reading’ several pages – we've all been there! The message you should take from my Head of Department's quote above is that just repeating in your writing what you are reading is little better than taking no notes at all: the secret is to digest what you have read and reproduce it in your own words and in summary form. Use plenty of headings and subheadings, boxes linked by arrows, cartoon drawings, etc. Another suggestion is to use different coloured pens for different recurring statistics, such as variance and correction factor. In fact, use anything that forces you to convert my text into as different a form as possible from the original; that will force you to concentrate, to involve your brain and to make it clear to you whether or not you have really understood that bit in the book so that it is safe to move on.
The actual process of making the notes is the critical step – you can throw the notes away at a later stage if you wish, though there's no harm in keeping them for a time for revision and reference.
So DON'T MOVE ON until you are ready. You'll only undo the value of previous effort if you persuade yourself that you are ready to move on when in your heart of hearts you know you are fooling yourself!
A key point in the book is Figure 7.5 on p. 64. Take real care to lay an especially good foundation up to there. If you really feel at home with this diagram, it is a sure sign that you have conquered any hang‐ups and are no longer a ‘terrified biologist’.
The obvious first step is to go back to the point in the book where you last felt confident, and start again from there.
However, it often helps to see how someone else has explained the same topic, so it's a good idea to have a look at the relevant pages of a different statistics text (see Appendix D for some suggestions). You could also look up the topic on the Internet, where many statisticians have put articles and their lectures to students.
A third possibility is to see if someone can explain things to you face to face. Do you know or have access to someone who might be able to help? If you are at university, it could be a fellow student or even one of the staff. The person who tried to teach statistics to my class at university failed completely as far as I was concerned, but later on I found he could explain things to me quite brilliantly in a one‐to‐one situation.
At certain points in the text you will find the sign of the elephant, i.e. .
They say ‘elephants never forget’ and the symbol means just that: NEVER FORGET! I have used it to mark some key statistical concepts which, in my experience, people easily forget and as a result run into trouble later on and find it hard to see where they have gone wrong. So, take it from me that it is really well worth making sure these matters are firmly embedded in your memory.
As stated in the Preface to the First Edition, I soon learnt that biologists don't like x. For some reason they prefer a real number but are more prepared to accept, say, 45 as representing any number than they are an x! Therefore, in order to avoid ‘algebra’ as far as possible, I have used actual numbers to illustrate the working of statistical analyses and tests. You probably won't gain a lot by keeping up with me on a hand calculator as I describe the different steps of a calculation, but you should make sure at each step that you understand where each number in a calculation has come from and why it has been included in that way.
When you reach the end of each worked analysis or test, however, you should go back to the original source of the data in the book and try to rework on a hand calculator the calculations which follow from just those original data. Try not to look up later stages in the calculations unless you are irrevocably stuck, and then use the executive summary (if there is one at the end of the chapter) rather than the main text.
There will be a lot of individual variation among readers of this book in the knowledge and experience of statistics they have gained in the past, and in their ability to grasp and retain statistical concepts. At certain points, therefore, some will be happy to move on without any further explanation from me or any further repetition of calculation procedures.
For those less happy to take things for granted at such points, I have placed the material and calculations they are likely to find helpful in boxes in order not to hold back or irritate the others. Calculations in the boxes may prove particularly helpful if, as suggested above, you are reworking a numerical example from the text and need to refer to a box to find out why you are stuck or perhaps where you went wrong.
These are numerical exercises you should be equipped to complete by the time you reach them at the end of several of the chapters.
That is the time to stop and do them. Unlike the within‐chapter numerical examples, you should feel quite free to use any material in previous chapters or executive summaries to remind you of the procedures involved and guide you through them. Use a hand calculator and remember to write down the results of intermediate calculations. This will make it much easier for you to detect where you went wrong if your answers do not match the solution to that exercise given in Appendix C. Do read the beginning of that appendix early on: it explains that you should not worry or waste time recalculating if your numbers are similar, even if they are not identical. I can assure you, you will recognise – when you compare your figures with the ‘solution’ – if you have followed the statistical steps of the exercise correctly; you will also immediately recognise if you have not.
Doing these exercises conscientiously with a hand calculator or spreadsheet, and when you reach them in the book rather than much later, is really important. They are the best things in the book for impressing the subject into your long‐term memory and for giving you confidence that you understand what you are doing.
The authors of most other statistics books recognise this and also include exercises. If you're willing, I would encourage you to gain more confidence and experience by going on to try the methods as described in this book on their exercises.
By the way, a blank spreadsheet such as Excel makes a grand substitute for a hand calculator, with the added advantage that repeat calculations (e.g. squaring numbers) can be copied and pasted from the first number to all the others.
Certain chapters end with such a summary, which aims to condense the meat of the chapter into little over a page or so. The summaries provide a condensed reference source for the calculations scattered throughout the previous chapter, with hopefully enough explanatory wording to jog your memory about how the calculations were derived. They will therefore prove useful when you tackle the spare‐time activities.
You might ask (and some of the reviews of the first edition did): why teach how to do statistical analyses on a hand calculator when we can type the data into a computer program and get all the calculations done automatically? It might have been useful once, but now…?
Well, I can assure you that you wouldn't ask that question if you had examined as many project reports and theses as I have, and seen the consequences of just ‘typing data into a computer program’. No, it does help to avoid trouble if you understand what the computer should be doing.
Why do I say that?
Right at the back of this book is a short list of other statistics books. Very many such books have been written, and I only have personal experience of a small selection. Some of these I have found particularly helpful, either to increase my comprehension of statistics (much needed at times!) or to find details of and recipes for more advanced statistical methods. I must emphasise that I have not even seen a majority of the books that have been published and that the ones that have helped me most may not be the ones that would be of most help to you. Omission of a title from my list implies absolutely no criticism of that book, and – if you see it in the library – do look at it carefully: it could well be the best book for you.
Statistics are summaries or collections of numbers. If you say ‘the tallest person among my friends is 173 cm tall’, that is a statistic. It is a number based on a scrutiny of lots of different numbers – the different heights of all your friends, but reporting just the largest number.
If you say ‘the average height of my friends is 158 cm’ – then that is another statistic. It again requires you to have collected the different heights of all your friends, but this time your statistic, the average height, is a summary statistic. It has summarised all the data you collected into a single number.
If you have lots and lots and lots of friends, it may not be practical to measure them all, but you can probably get a good estimate of the average height by measuring not all of them but a large sample, and calculating the average of the sample. Now the average of your sample, particularly if it's not very big, may not be identical to the true average of all your friends. This brings us to a key principle of statistics. We are usually trying to evaluate a parameter (from the Greek for ‘beyond measurement’) by making an estimate from a sample of a practical size. So we must always distinguish parameters and estimates. One example: in statistics we use the word mean for the estimate (from a sample of numbers) of something we can rarely measure; the parameter we call the average (of the entire existing population of numbers).
‘Add together all the numbers in the sample, and divide by the number of numbers’. That's how we actually calculate a mean, isn't it? So even that very simple statistic takes 15 words to describe as a procedure. Things can get much more complicated (see Box 2.1).
We really have to find a shorthand way of expressing statistical computations, and this shorthand is called notation. The off‐putting thing about notation for biologists is that it tends to be algebraic in character. Also there is no universally accepted notation, and the variations between different textbooks are naturally pretty confusing to the beginner!
What is perhaps worse is a purely psychological problem for most biologists: your worry level has perhaps already risen at the very mention of algebra! Confront a biologist with an x instead of a number like 57 and there is a tendency to switch off the receptive centres of the brain altogether. Yet most statistical calculations involve nothing more terrifying than addition, subtraction, multiplication, and division – though I must admit you will also have to square numbers and find square roots. All this can now be done with the cheapest hand calculators.
Most of you now own or have access to a computer, where you only have to type the sampled numbers into a spreadsheet or other program and the machine has all the calculations that have to be done already programmed. So do computers remove the need to understand what their programs are doing? I don't think so! I discussed all this more fully in Chapter 1, but repeat it here in case you have skipped that chapter. Briefly, you need to know what programs are right for what sort of data, and what the limitations are. So an understanding of data analysis will enable you to plan more effective experiments. Remember that the computer will be quite content to process your figures perfectly inappropriately, if that is what you request! It may also be helpful to know how to interpret the final printout… correctly.
Back to the subject of notation. As I have just pointed out, we are going to be involved in quite unsophisticated number‐crunching and the whole point of notation is to remind us of the order in which we do this. Note I say ‘remind us’ and not ‘tell us’. It would be quite dangerous, when you start out, to think that notation enables you to do the calculations without previous experience. You just can't turn to p. 257 of a statistics book and expect to tackle something like:
without sufficient homework on the notation for it to remind you of what you have previously learnt! Incidentally, Box 2.1 translates this algebraic notation into English for two columns of paired numbers (values of x and y).
As pointed out earlier, the formula above gives you the order of using the three component in the calculation, i.e. you add together the two components under the line and then divide the top component by this sum. But the formula doesn't tell you how to calculate the three components. For this you have to identify the three components in English as:
These terms will probably mean nothing to you at this stage, but being able to calculate the sum of squares of a set of numbers is just about as common a procedure as calculating the mean.
We frequently have to cope with notation elsewhere in life. Recognise 03.11.92? It's a date, perhaps a ‘date of birth’. The Americans use a different notation; they would write the same birthday as 11.03.92. And do you recognise Cm? You probably do if you are a musician: it's notation for playing together the three notes C, Eb, and G – the chord of C minor (hence Cm).
In this book, the early chapters include notation to help you remember what statistics (i.e. summaries of data) such as sums of squares are and how they are calculated. However, as soon as possible, we will be using keywords such as sums of squares to replace blocks of algebraic notation. This should make the pages less frightening and make the text flow better. After all, you can always go back to an earlier chapter if you need reminding of the notation and the calculation it represents. I guess it's a bit like cooking! The first half‐dozen times you want to make pancakes you need the cookbook to provide the information that 300 ml milk goes with 125 g plain flour, one egg, and some salt, but the day soon comes when you merely think the word batter!
No one is hopefully going to baulk at the challenge of calculating the mean height of a sample of five people, say 149, 176, 152, 180, and 146 cm, by totalling the numbers and dividing by 5.
In statistical notation, the instruction to total is ∑, and the whole series of numbers to total is called by a letter, often x or y. So ∑x means ‘add together all the numbers in the series called x’, the five heights of people in our example. We use n for the ‘number of numbers’, 5 in the example, making the full notation for a mean:
However, we use the mean so often that it has another even shorter notation: the identifying letter for the series (e.g. x) with a line over the top, i.e. .