Cover Page

Statistics for Terrified Biologists

Second Edition

Helmut F. van Emden

Emeritus Professor of Horticulture,
School of Biological Sciences,
The University of Reading, UK

Wiley Logo

Preface to the second edition

I have been astounded by the positive reception the first edition of my little book has received. It has clearly filled a need; students from many different countries have emailed me just to say ‘Thank you for writing the book.’ Student reviews on Amazon have given my book high praise, and colleagues teaching statistics to biologists have also found it of value. One review complained about the ‘lack of maths’. I take this as a compliment, and the review arguing that the availability of computer packages makes it unnecessary to be able to carry out the calculations ‘long‐hand’ seems to me to completely miss the point!

But I have to hang my head in shame at the numerous errors there were in the first printing of 2008. Particularly the numerical examples were littered with computational errors (not infrequently in the form of dyslexic transpositions). The book went through several iterations before going to press, and I really should have rechecked all the calculations in the final proof. I can only apologise. However, these many errors led to a very encouraging outcome. The errors were largely identified by users, who then contacted me convinced they had found an error. The statistics teaching I received never gave me similar confidence. That students were sure they were right and that I had made an error is almost the best evidence of the success of my book that I could have asked for.

So what's new in this second edition? The statistics that my book tries to teach are the concepts elaborated by R.A. Fisher in the 1930s, and today these remain the basis of the statistical procedures based on the normal distribution. I can tweak the English, change sections to make them more easily understood, and add some further extensions of Fisher's concepts – and all this I have done. I have been surprised how I, as a reader of the text of the first edition, have found sentences completely incomprehensible that were written by me as author 10 years ago! I have used the opportunity of the second edition to try to clarify these passages for both you and me! I have greatly revised and, I hope, simplified the chapter on linear regression and correlation and added some material to the chapter on chi‐square tests. However, the most significant addition has been that analysis of covariance, just briefly mentioned in the first edition, has been allocated a chapter of its own. It deserves emphasis as a valuable statistical technique, and I find most textbooks shroud it in mystery. Although I would advise you to use computer programs if you want to use analysis of covariance, I thought it a good idea to demystify it as much as I can. The calculation of numbers in analysis of covariance is usually presented as having no connection with more familiar calculations, yet they come from the standard techniques of analysis of variance and regression. The point is that only some of the resulting values are then used while the rest are discarded, but doing the analyses that include these redundant numbers does make the whole thing much more comprehensible.

I can only hope that this second edition proves as popular as the first.

Helmut F. van Emden

July 2018

Preface to the first edition

I have written/edited several books on my own speciality, agricultural and horticultural entomology, but always at the request of a publisher or colleagues. This book is different in two ways. Firstly, it is a book I have positively wanted to write for many years and secondly, I am stepping out of my ‘comfort zone’ in doing so.

The origins of this book stem from my appointment to the Horticulture Department of Reading University under Professor O. V. S. Heath, FRS. Professor Heath appreciated the importance of statistics and, at a time when there was no University‐wide provision of statistics teaching, he taught the final year students himself. I became the ‘assistant’ whose role it was to run the practical exercises which followed the Professor's lecture. You cannot teach what you do not understand yourself, but I tried nonetheless.

Eventually I took over the entire course. By then it was taught in the second year and in the third year the students went on to take a Faculty‐wide course. I did not continue the lectures; the whole course was in the laboratory where I led the students (using pocket calculators) through the calculations in stages. The laboratory class environment involved continuous interaction with students in a way totally different from what happens in lectures, and it rapidly became clear to me that many biologists have their neurons wired up in a way that makes the traditional way of teaching statistics rather difficult for them.

What my students needed was confidence – confidence that statistical ideas and methods were not just theory, but actually worked with real biological data and, above all, had some basis in logic! As the years of teaching went on, I began to realise that the students regularly found the same steps a barrier to progress and a damage to their confidence. Year after year I tried new ways to help them over these ‘crisis points’; eventually I succeeded with all of them, I am told.

The efficacy of my unusual teaching aids can actually be quantified. After the Faculty course taught by professional statisticians, my students were formally examined together with cohorts of students from other departments in the Faculty (then of ‘Agriculture and Food’) who had attended the same third year course in Applied Statistics. My students mostly (perhaps all but three per year out of some 20) appeared in the mark list as a distinct block right at the upper end, with percentage marks in the 70s, 80s and even 90s. Although there may have also been one or two from other courses with high marks, there then tended to be a gap till marks in the lower 60s appeared and began a continuum down to single figures.

I therefore feel confident that this book will be helpful to biologists with its mnemonics such as SqADS and ‘you go along the corridor before you go upstairs’ Other things previously unheard of are the ‘lead line’ and ‘supertotals’ with their ‘subscripts’ – yet all have been appreciated as most helpful by my students over the years. A riffle through the pages will amaze – where are the equations and algebraic symbols? They have to a large extent been replaced by numbers and words. The biologists I taught – and I don't think they were atypical – could work out what to do with a ‘45’, but rarely what to do with an ‘x’. Also, I have found that there are a number of statistical principles students easily forget, and then inevitably run into trouble with their calculations. These concepts are marked with the symbol of a small elephant image.

The book limits itself to the basic foundations of parametric statistics, the t‐test, analysis of variance, linear regression and chi‐square. However, the reader is guided as to where there are important extensions of these techniques, and there is an introduction to non‐parametric tests which includes a check list of non‐parametric methods linked to their parametric counterparts. Many chapters end with an ‘executive summary’ as a quick source for revision, and there are additional exercises to give the practice which is so essential to learning.

In order to minimise algebra, the calculations are explained with numerical examples. These, as well as the ‘spare‐time activity’ exercises have come from many sources, and I regret the origin of many has become lost in the mists of time. Quite a number come from experiments carried out by Horticulture students at Reading as part of their second year outdoor practicals, and others have been totally fabricated in order to ‘work out’ well. Others have had numbers or treatments changed better to fit what was needed. I can only apologise to anyone whose data I have used without due acknowledgement; failure to do so is not intentional. But please remember that data have often been fabricated or massaged – therefore do not rely on the results as scientific evidence for what they appear to show!

Today, computer programmes take most of the sweat out of statistical procedures, and most biologists have access to professional statisticians. ‘Why bother to learn basic statistics?’ is therefore a perfectly fair question, akin to ‘Why keep a dog and bark?’ The question deserves an answer; to save repetition, my answer can be found towards the end of Chapter 1.

I am immensely grateful to the generations of Reading students who have challenged me to overcome their ‘hang‐ups’ and who have therefore contributed substantially to any success this book achieves. Also many postgraduate students as well as experienced visiting overseas scientists have encouraged me to turn my course into book form. My love and special thanks go to my wife Gillian who, with her own experience of biological statistics, has supported and encouraged me in writing this book; it is to her that I owe the imaginative title for the book.

Finally, I should like to thank Ward Cooper of Blackwells for having faith in this biologist, who is less terrified of statistics than he once was.

Helmut F. van Emden

December 2006

1
How to use this book

Chapter features

Introduction

Don't be misled! This book cannot replace effort on your part. All it can aspire to do is to make that effort effective. The detective thriller only succeeds because you have read it too fast and not really concentrated  with that approach, you'll find this book just as mysterious.

In fact, you may not get very far if you just read this book at any speed! You will only succeed if you interact with the text, and how you might do this is the topic of most of this chapter.

The text of the chapters

The chapters, particularly 2–8, develop a train of thought essential to the subject of analysing biological data. You just have to take these chapters in order and quite slowly. There is only one way I know for you to maintain the concentration necessary to comprehension, and that is for you to make your own summary notes as you go along.

My Head of Department when I first joined the staff at Reading used to define a university lecture as ‘a technique for transferring information from a piece of paper in front of the lecturer to a piece of paper in front of the student, without passing through the heads of either’. That's why I stress making your own summary notes. You will retain very little by just reading the text; you'll find that after a while you've been thinking about something totally different while ‘reading’ several pages  we've all been there! The message you should take from my Head of Department's quote above is that just repeating in your writing what you are reading is little better than taking no notes at all: the secret is to digest what you have read and reproduce it in your own words and in summary form. Use plenty of headings and subheadings, boxes linked by arrows, cartoon drawings, etc. Another suggestion is to use different coloured pens for different recurring statistics, such as variance and correction factor. In fact, use anything that forces you to convert my text into as different a form as possible from the original; that will force you to concentrate, to involve your brain and to make it clear to you whether or not you have really understood that bit in the book so that it is safe to move on.

The actual process of making the notes is the critical step  you can throw the notes away at a later stage if you wish, though there's no harm in keeping them for a time for revision and reference.

So DON'T MOVE ON until you are ready. You'll only undo the value of previous effort if you persuade yourself that you are ready to move on when in your heart of hearts you know you are fooling yourself!

A key point in the book is Figure 7.5 on p. 64. Take real care to lay an especially good foundation up to there. If you really feel at home with this diagram, it is a sure sign that you have conquered any hang‐ups and are no longer a ‘terrified biologist’.

What should you do if you run into trouble?

The obvious first step is to go back to the point in the book where you last felt confident, and start again from there.

However, it often helps to see how someone else has explained the same topic, so it's a good idea to have a look at the relevant pages of a different statistics text (see Appendix D for some suggestions). You could also look up the topic on the Internet, where many statisticians have put articles and their lectures to students.

A third possibility is to see if someone can explain things to you face to face. Do you know or have access to someone who might be able to help? If you are at university, it could be a fellow student or even one of the staff. The person who tried to teach statistics to my class at university failed completely as far as I was concerned, but later on I found he could explain things to me quite brilliantly in a one‐to‐one situation.

Elephants

At certain points in the text you will find the sign of the elephant, i.e. Sign of the elephant..

They say ‘elephants never forget’ and the symbol means just that: NEVER FORGET! I have used it to mark some key statistical concepts which, in my experience, people easily forget and as a result run into trouble later on and find it hard to see where they have gone wrong. So, take it from me that it is really well worth making sure these matters are firmly embedded in your memory.

The numerical examples in the text

As stated in the Preface to the First Edition, I soon learnt that biologists don't like x. For some reason they prefer a real number but are more prepared to accept, say, 45 as representing any number than they are an x! Therefore, in order to avoid ‘algebra’ as far as possible, I have used actual numbers to illustrate the working of statistical analyses and tests. You probably won't gain a lot by keeping up with me on a hand calculator as I describe the different steps of a calculation, but you should make sure at each step that you understand where each number in a calculation has come from and why it has been included in that way.

When you reach the end of each worked analysis or test, however, you should go back to the original source of the data in the book and try to rework on a hand calculator the calculations which follow from just those original data. Try not to look up later stages in the calculations unless you are irrevocably stuck, and then use the executive summary (if there is one at the end of the chapter) rather than the main text.

Boxes

There will be a lot of individual variation among readers of this book in the knowledge and experience of statistics they have gained in the past, and in their ability to grasp and retain statistical concepts. At certain points, therefore, some will be happy to move on without any further explanation from me or any further repetition of calculation procedures.

For those less happy to take things for granted at such points, I have placed the material and calculations they are likely to find helpful in boxes in order not to hold back or irritate the others. Calculations in the boxes may prove particularly helpful if, as suggested above, you are reworking a numerical example from the text and need to refer to a box to find out why you are stuck or perhaps where you went wrong.

Spare‐time activities

These are numerical exercises you should be equipped to complete by the time you reach them at the end of several of the chapters.

That is the time to stop and do them. Unlike the within‐chapter numerical examples, you should feel quite free to use any material in previous chapters or executive summaries to remind you of the procedures involved and guide you through them. Use a hand calculator and remember to write down the results of intermediate calculations. This will make it much easier for you to detect where you went wrong if your answers do not match the solution to that exercise given in Appendix C. Do read the beginning of that appendix early on: it explains that you should not worry or waste time recalculating if your numbers are similar, even if they are not identical. I can assure you, you will recognise  when you compare your figures with the ‘solution’  if you have followed the statistical steps of the exercise correctly; you will also immediately recognise if you have not.

Doing these exercises conscientiously with a hand calculator or spreadsheet, and when you reach them in the book rather than much later, is really important. They are the best things in the book for impressing the subject into your long‐term memory and for giving you confidence that you understand what you are doing.

The authors of most other statistics books recognise this and also include exercises. If you're willing, I would encourage you to gain more confidence and experience by going on to try the methods as described in this book on their exercises.

By the way, a blank spreadsheet such as Excel makes a grand substitute for a hand calculator, with the added advantage that repeat calculations (e.g. squaring numbers) can be copied and pasted from the first number to all the others.

Executive summaries

Certain chapters end with such a summary, which aims to condense the meat of the chapter into little over a page or so. The summaries provide a condensed reference source for the calculations scattered throughout the previous chapter, with hopefully enough explanatory wording to jog your memory about how the calculations were derived. They will therefore prove useful when you tackle the spare‐time activities.

Why go to all that bother?

You might ask (and some of the reviews of the first edition did): why teach how to do statistical analyses on a hand calculator when we can type the data into a computer program and get all the calculations done automatically? It might have been useful once, but now…?

Well, I can assure you that you wouldn't ask that question if you had examined as many project reports and theses as I have, and seen the consequences of just ‘typing data into a computer program’. No, it does help to avoid trouble if you understand what the computer should be doing.

Why do I say that?

  • Planning experiments is made much more effective if you understand the advantages and disadvantages of different experimental designs and how they affect the experimental error against which we test our differences between treatments. It probably won't mean much to you now, but you really do need to understand how experimental design as well as treatment and replicate numbers impact the residual degrees of freedom and whether you should be looking at one‐tailed or two‐tailed statistical tables. My advice to my students has always been that, before embarking on an experiment, they should draw up a blank form on which to enter the results, then invent some results and complete the appropriate analysis on them. It can often cause you to think again.
  • Although the computer can carry out your calculations for you, it has the terminal drawback that it will accept the numbers you type in without challenging you as to whether what you are asking it to do with them is sensible. Thus  and again at this stage you'll have to accept my word that these are critical issues  no window will appear on the screen that says: ‘Whoa! You should be analysing these numbers non‐parametrically’ or ‘No problem. I can do an ordinary factorial analysis of variance, but you seem to have forgotten you actually used a split‐plot design’ or ‘These numbers are clearly pairs; why don't you exploit the advantages of pairing in the t‐test that you've told me to do?’ or ‘I'm surprised you are asking for the statistics for drawing a straight line through the points on this obvious hollow curve.’ I could go on.
  • You will no doubt use computer programs rather than a hand calculator for your statistical calculations in the future. But the printouts from these programs are often not particularly user‐friendly. They usually assume some knowledge of the internal structure of the analysis the computer has carried out, and abbreviations identify the numbers printed out. So obviously an understanding of what your computer program is doing and familiarity with statistical terminology can only be a help.
  • A really important value you will gain from this book is confidence that statistical methods are not a ‘black box’ somewhere inside a computer, but that you could in extremis (and with this book at your side) carry out the analyses and tests on the back of an envelope with a hand calculator. Also, once you have become content that the methods covered in this book are concepts you understand, you will probably be happier using the relevant computer programs.
  • More than that, you will probably be happy to expand the methods you use to ones I have not covered, on the basis that they are likely also to be logical, sensible, and understandable routes to passing satisfactory judgements on biological data. Expansions of the methods included in this book (e.g. those mentioned at the end of Chapter 17 ) will require you to use numbers produced by the calculations I have covered. You should be able confidently to identify which these are.
  • You will probably find yourself discussing your proposed experiment and later the appropriate analysis with a professional statistician. It does so help to speak the same language! Additionally, the statistician will be of much more help to you if you are competent to see where the latter has missed a constraint to the statistical advice given arising from biological realities.
  • Finally, there is the intellectual satisfaction of mastering a subject which can come hard to biologists. Unfortunately, you won't appreciate it was worth doing until you view the effort from the hindsight of having succeeded. I assure you the reward is real. I can still remember vividly the occasion many years ago when, in the middle of teaching an undergraduate statistics class, I realised how simple the basic idea behind the analysis of variance was, and how this extraordinary simplicity was only obfuscated for a biologist by the short‐cut calculation methods used. In other words, I was in a position to write Chapter 10. Later, the gulf between most biologists and trained statisticians was really brought home to me by one of the latter's comments on an early version of this book: ‘I suggest Chapter 10 should be deleted; it's not the way we do it.’ I rest my case!

The bibliography

Right at the back of this book is a short list of other statistics books. Very many such books have been written, and I only have personal experience of a small selection. Some of these I have found particularly helpful, either to increase my comprehension of statistics (much needed at times!) or to find details of and recipes for more advanced statistical methods. I must emphasise that I have not even seen a majority of the books that have been published and that the ones that have helped me most may not be the ones that would be of most help to you. Omission of a title from my list implies absolutely no criticism of that book, and  if you see it in the library  do look at it carefully: it could well be the best book for you.

2
Introduction

Chapter features

What are statistics?

Statistics are summaries or collections of numbers. If you say ‘the tallest person among my friends is 173 cm tall’, that is a statistic. It is a number based on a scrutiny of lots of different numbers  the different heights of all your friends, but reporting just the largest number.

If you say ‘the average height of my friends is 158 cm’  then that is another statistic. It again requires you to have collected the different heights of all your friends, but this time your statistic, the average height, is a summary statistic. It has summarised all the data you collected into a single number.

If you have lots and lots and lots of friends, it may not be practical to measure them all, but you can probably get a good estimate of the average height by measuring not all of them but a large sample, and calculating the average of the sample. Now the average of your sample, particularly if it's not very big, may not be identical to the true average of all your friends. This brings us to a key principle of statistics. We are usually trying to evaluate a parameter (from the Greek for ‘beyond measurement’) by making an estimate from a sample of a practical size. So we must always distinguish parameters and estimates. One example: in statistics we use the word mean for the estimate (from a sample of numbers) of something we can rarely measure; the parameter we call the average (of the entire existing population of numbers).

Notation

‘Add together all the numbers in the sample, and divide by the number of numbers’. That's how we actually calculate a mean, isn't it? So even that very simple statistic takes 15 words to describe as a procedure. Things can get much more complicated (see Box 2.1).

We really have to find a shorthand way of expressing statistical computations, and this shorthand is called notation. The off‐putting thing about notation for biologists is that it tends to be algebraic in character. Also there is no universally accepted notation, and the variations between different textbooks are naturally pretty confusing to the beginner!

What is perhaps worse is a purely psychological problem for most biologists: your worry level has perhaps already risen at the very mention of algebra! Confront a biologist with an x instead of a number like 57 and there is a tendency to switch off the receptive centres of the brain altogether. Yet most statistical calculations involve nothing more terrifying than addition, subtraction, multiplication, and division  though I must admit you will also have to square numbers and find square roots. All this can now be done with the cheapest hand calculators.

Most of you now own or have access to a computer, where you only have to type the sampled numbers into a spreadsheet or other program and the machine has all the calculations that have to be done already programmed. So do computers remove the need to understand what their programs are doing? I don't think so! I discussed all this more fully in Chapter 1, but repeat it here in case you have skipped that chapter. Briefly, you need to know what programs are right for what sort of data, and what the limitations are. So an understanding of data analysis will enable you to plan more effective experiments. Remember that the computer will be quite content to process your figures perfectly inappropriately, if that is what you request! It may also be helpful to know how to interpret the final printout… correctly.

Back to the subject of notation. As I have just pointed out, we are going to be involved in quite unsophisticated number‐crunching and the whole point of notation is to remind us of the order in which we do this. Note I say ‘remind us’ and not ‘tell us’. It would be quite dangerous, when you start out, to think that notation enables you to do the calculations without previous experience. You just can't turn to p. 257 of a statistics book and expect to tackle something like:

equation

without sufficient homework on the notation for it to remind you of what you have previously learnt! Incidentally, Box 2.1 translates this algebraic notation into English for two columns of paired numbers (values of x and y).

As pointed out earlier, the formula above gives you the order of using the three component in the calculation, i.e. you add together the two components under the line and then divide the top component by this sum. But the formula doesn't tell you how to calculate the three components. For this you have to identify the three components in English as:

equation

These terms will probably mean nothing to you at this stage, but being able to calculate the sum of squares of a set of numbers is just about as common a procedure as calculating the mean.

We frequently have to cope with notation elsewhere in life. Recognise 03.11.92? It's a date, perhaps a ‘date of birth’. The Americans use a different notation; they would write the same birthday as 11.03.92. And do you recognise Cm? You probably do if you are a musician: it's notation for playing together the three notes C, Eb, and G  the chord of C minor (hence Cm).

In this book, the early chapters include notation to help you remember what statistics (i.e. summaries of data) such as sums of squares are and how they are calculated. However, as soon as possible, we will be using keywords such as sums of squares to replace blocks of algebraic notation. This should make the pages less frightening and make the text flow better. After all, you can always go back to an earlier chapter if you need reminding of the notation and the calculation it represents. I guess it's a bit like cooking! The first half‐dozen times you want to make pancakes you need the cookbook to provide the information that 300 ml milk goes with 125 g plain flour, one egg, and some salt, but the day soon comes when you merely think the word batter!

Notation for calculating the mean

No one is hopefully going to baulk at the challenge of calculating the mean height of a sample of five people, say 149, 176, 152, 180, and 146 cm, by totalling the numbers and dividing by 5.

In statistical notation, the instruction to total is ∑, and the whole series of numbers to total is called by a letter, often x or y. So x means ‘add together all the numbers in the series called x’, the five heights of people in our example. We use n for the ‘number of numbers’, 5 in the example, making the full notation for a mean:

equation

However, we use the mean so often that it has another even shorter notation: the identifying letter for the series (e.g. x) with a line over the top, i.e. images .