Cover
Preface to the second edition
Preface to the first edition
1 How to use this book
1. Introduction
2. The text of the chapters
3. What should you do if you run into trouble?
4. Elephants
5. The numerical examples in the text
6. Boxes
7. Spare‐time activities
8. Executive summaries
9. Why go to all that bother?
10. The bibliography
2 Introduction
1. What are statistics?
2. Notation
3. Notation for calculating the mean
3 Summarising variation
1. Introduction
2. Different summaries of variation
3. Why n − 1?
4. Why are the deviations squared?
5. The standard deviation
6. The next chapter
4 When are sums of squares NOT sums of squares?
1. Introduction
2. Calculating machines offer a quicker method of calculating the sum of squares
3. Avoid being confused by the term sum of squares
4. Summary of the calculator method for calculations as far as the standard deviation
5 The normal distribution
1. Introduction
2. Frequency distributions
3. The normal distribution
4. What percentage is a standard deviation worth?
5. Are the percentages always the same as these?
6. Other similar scales in everyday life
7. The standard deviation as an estimate of the frequency of a number occurring in a sample
8. From percentage to probability
9. EXECUTIVE SUMMARY 1 The standard deviation
6 The relevance of the normal distribution to biological data
1. To recap
2. Is our observed distribution normal?
3. What can we do about a distribution that clearly is not normal?
4. How many samples are needed?
7 Further calculations from the normal distribution
1. Introduction
2. Is A bigger than B?
3. The yardstick for deciding
4. Derivation of the standard error of a difference between two means
5. The importance of the standard error of differences between means
6. Summary of this chapter
7. EXECUTIVE SUMMARY 2 Standard error of a difference between two means
8 The t‐test
1. Introduction
2. The principle of the t‐test
3. The t‐test in statistical terms
4. Why t?
5. Tables of the t‐distribution
6. The standard t‐test
7. t‐test for means associated with unequal variances
8. The paired t‐test
9. EXECUTIVE SUMMARY 3 Thet‐test
9 One tail or two?
1. Introduction
2. Why is the analysis of variance F‐test one‐tailed?
3. The two‐tailed F‐test
4. How many tails has the t‐test?
5. The final conclusion on number of tails
10 Analysis of variance (ANOVA): what is it? How does it work?
1. Introduction
2. Sums of squares in ANOVA
3. Some ‘made‐up’ variation to analyse by ANOVA
4. The sum of squares table
5. Using ANOVA to sort out the variation in Table C
6. The relationship between t and F
7. Constraints on ANOVA
8. Comparison between treatment means in ANOVA
9. The least significant difference
10. A caveat about using the LSD
11. EXECUTIVE SUMMARY 4 The principle of ANOVA
11 Experimental designs for analysis of variance (ANOVA)
1. Introduction
2. Fully randomised
3. Randomised blocks
4. Incomplete blocks
5. Latin square
6. Split plot
7. Types of analysis of variance
8. EXECUTIVE SUMMARY 5 Analysis of a one‐way randomised block experiment
12 Introduction to factorial experiments
1. What is a factorial experiment?
2. Interaction: what does it mean biologically?
3. How does a factorial experiment change the form of the analysis of variance?
4. Sums of squares for interactions
13 2‐Factor factorial experiments
1. Introduction
2. An example of a 2‐factor experiment
3. Analysis of the 2‐factor experiment
4. Two important things to remember about factorials before tackling the next chapter
5. Analysis of factorial experiments with unequal replication
6. EXECUTIVE SUMMARY 6 Analysis of a 2‐factor randomised block experiment
14 Factorial experiments with more than two factors – leave this out if you wish!
1. Introduction
2. Different ‘orders’ of interaction
3. Example of a 4‐factor experiment
15 Factorial experiments with split plots
1. Introduction
2. Deriving the split plot design from the randomised block design
3. Degrees of freedom in a split plot analysis
4. Numerical example of a split plot experiment and its analysis
5. Comparison of split plot and randomised block experiments
6. Uses of split plot designs
16 The t‐test in the analysis of variance
1. Introduction
2. Brief recap of relevant earlier sections of this book
3. Least significant difference test
4. Multiple range tests
5. Testing differences between means
6. Presentation of the results of tests of differences between means
7. The results of the experiments analysed by analysis of variance in Chapters 11–15
8. Some final advice
17 Linear regression and correlation
1. Introduction
2. Cause and effect
3. Other traps waiting for you to fall into
4. Regression
5. Independent and dependent variables
6. The regression coefficient (b)
7. Calculating the regression coefficient (b)
8. The regression equation
9. A worked example on some real data
10. Correlation
11. Extensions of regression analysis
12. EXECUTIVE SUMMARY Linear regression
18 Analysis of covariance (ANCOVA)
1. Introduction
2. A worked example of ANCOVA
3. Executive Summary 8 Analysis of covariance (ANCOVA)
19 Chi‐square tests
1. Introduction
2. When not and where not to use χ²
3. The problem of low frequencies
4. Yates' correction for continuity
5. The χ² test for goodness of fit
6. Association (or contingency) χ²
20 Nonparametric methods (what are they?)
1. Disclaimer
2. Introduction
3. Advantages and disadvantages of parametric and nonparametric methods
4. Some ways data are organised for nonparametric tests
5. The main nonparametric methods that are available
Appendix A: How many replicates?
1. Introduction
2. Underlying concepts
3. ‘Cheap and cheerful’ calculation of number of replicates needed
4. More accurate calculation of number of replicates needed
5. How to prove a negative
Appendix B: Statistical tables
Appendix C: Solutions to spare‐time activities
1. Chapter 3
2. Chapter 4
3. Chapter 7
4. Chapter 8
5. Chapter 11
6. Chapter 13
7. Chapter 14
8. Chapter 15
9. Chapter 16
10. Chapter 17
11. Chapter 18
12. Chapter 19
13. The Clues (See ‘Spare‐time Activity’ to Chapter , p. 261)
Appendix D: Bibliography
1. Introduction
2. The Internet
Index
End User License Agreement

List of Illustrations

Chapter 3
1. Fig. 3.1 An unusual ruler! The familiar scale increasing left to right has...
Chapter 5
1. Fig. 5.1 An example of the normal distribution: the frequencies in which 1...
2. Fig. 5.2 The normal distribution curve of Figure 5.1 with the egg weight s...
3. Fig. 5.3 Demonstration that the areas under the normal curve containing th...
4. Fig. 5.4 The unifying concept of the scale of standard deviations for two ...
5. Fig. 5.5 A scale of percentage observations included by different standard...
Chapter 6
1. Fig . 6.1 The frequency of occurrence of larvae of the frit fly (Oscinella...
2. Fig. 6.2 Results of calculating the standard deviation of repeat samples o...
3. Fig. 6.3 The frequency distribution of frit fly larvae (see Figure 6.1) wi...
Chapter 7
1. Fig. 7.1 The unifying concept of the scale of standard deviations/errors f...
2. Fig. 7.2 Figure 7.1 redrawn so that the same scale is used for weight, whe...
3. Fig. 7.3 Random sampling of the same population of numbers gives a much sm...
4. Fig. 7.4 Differences between two means illustrated, with the notation for ...
5. Fig. 7.5 Schematic representation of how the standard error of difference ...
6. Fig. 7.6 Figure 7.1 of the normal curve for egg weights (showing the propo...
Chapter 8
1. Fig. 8.1 Improved estimate of true variance (dotted horizontal lines) with...
2. Fig. 8.2 Confidence limits of 95% for observations of the most variable di...
3. Fig. 8.3 Summary of the procedure involved in the basic t‐test.
Chapter 9
1. Fig . 9.1 Areas under the normal curve appropriate to probabilities of 20%...
Chapter 10
1. Fig. 10.1 F values for four treatments (n − 1 = 3), graphed from tables of...
2. Fig. 10.2 The comparison of only seven numbers already involves 21 tests!...
Chapter 11
1. Fig . 11.1 Fully randomised design: an example of how 4 repeats of 3 treat...
2. Fig. 11.2 Fully randomised design: the layout of Figure 11.1 with data add...
3. Fig. 11.3 Randomised block design: a possible randomisation of three treat...
4. Fig. 11.4 Randomised block design: how not to do it! Four blocks of 10 tre...
5. Fig. 11.5 Randomised block design: the layout of Figure 11.3 with data add...
6. Fig. 11.6 Incomplete randomised block design: a possible layout of seven t...
7. Fig. 11.7 Lattice design of incomplete blocks: here pairs in a block (e.g....
8. Fig. 11.8 Latin square design: (a) a possible layout for four treatments; ...
9. Fig. 11.9 Latin square design: the layout of Figure 11.8a with data added....
10. Fig. 11.10 Latin square designs: a possible randomisation of four treatmen...
Chapter 12
1. Fig . 12.1 The same six factorial combinations of two factors laid out as ...
2. Fig. 12.2 Time in seconds taken by a walker, a bike, and a car to cover 50...
Chapter 13
1. Fig . 13.1 A 2‐factor randomised block experiment with three blocks applyi...
Chapter 15
1. Fig . 15.1 Relationship between a randomised and a split plot design: (a) ...
2. Fig. 15.2 Yield of sprouts (kg/plot) from the experimental design shown in...
Chapter 17
1. Fig . 17.1 Relationship between population density in an area and the annu...
2. Fig. 17.2 Decline in deaths per 1000 people from malaria following the int...
3. Fig. 17.3 (a) The plot of increasing per cent kill of insects against incr...
4. Fig. 17.4 The triumph of statistics over common sense. Each treatment mean...
5. Fig. 17.5 (a) The data points would seem to warrant the conclusion that la...
6. Fig. 17.6 Data which fail to meet the criterion for regression or correlat...
7. Fig. 17.7 The importance of allocating the related variables to the correc...
8. Fig. 17.8 The characteristics of the regression (b) and correlation (r) co...
9. Fig. 17.9 Diagrammatic representation of a regression line as a plank laid...
10. Fig. 17.10 (a) Six data points for which one might wish to calculate a reg...
11. Fig. 17.11 The distances of each point in Figure 17.10b from the mean of t...
12. Fig. 17.12 Eight points each with a b of + or − 2. Including all points in...
13. Fig. 17.13 The calculated regression line added to Figure 17.10a.
14. Fig. 17.14 (a) A regression line has an intercept (a) value on the vertica...
15. Fig. 17.15 The calculated regression line for the data points relating the...
16. Fig. 17.16 The distances for which sums of squares are calculated in the r...
17. Fig. 17.17 (a) The correlation between length (y) and width (x) of Azalea ...
18. Fig. 17.18 Examples of nonlinear regression lines together with the form o...
Chapter 18
1. Fig . 18.1 The relationship between regression analysis and analysis of co...
2. Fig. 18.2 Regression lines for cholesterol level on age relevant to each d...
3. Fig. 18.3 (a) The regression of cholesterol level on age. (b) Another data...
4. Fig. 18.4 An enlarged portion of Figure 18.3 showing the effect of adjusti...
5. Fig. 18.5 The relationship between regression analysis and analysis of cov...

Copyright

This second edition first published 2019

Edition History

Blackwell. (1e 2008)

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Helmut F. van Emden to be identified as the author of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: van Emden, H. F. (Helmut Fritz), author.

Title: Statistics for terrified biologists / Helmut F. van Emden (emeritus

professor of horticulture, School of Biological Sciences, The University

of Reading, UK).

Description: 2nd edition. | Hoboken, NJ : Wiley, 2019. | Includes

bibliographical references and index. |

Identifiers: LCCN 2019001708 (print) | LCCN 2019002653 (ebook) | ISBN

9781119563693 (Adobe PDF) | ISBN 9781119563686 (ePub) | ISBN 9781119563679

(pbk.)

Subjects: LCSH: Biometry–Textbooks. | Statistics–Textbooks.

Classification: LCC QH323.5 (ebook) | LCC QH323.5 .V33 2019 (print) | DDC

570.1/5195–dc23

LC record available at https://lccn.loc.gov/2019001708

Cover Design: Wiley

Cover Image: © gulfu photography/Getty Images

Preface to the second edition

I have been astounded by the positive reception the first edition of my little book has received. It has clearly filled a need; students from many different countries have emailed me just to say ‘Thank you for writing the book.’ Student reviews on Amazon have given my book high praise, and colleagues teaching statistics to biologists have also found it of value. One review complained about the ‘lack of maths’. I take this as a compliment, and the review arguing that the availability of computer packages makes it unnecessary to be able to carry out the calculations ‘long‐hand’ seems to me to completely miss the point!

But I have to hang my head in shame at the numerous errors there were in the first printing of 2008. Particularly the numerical examples were littered with computational errors (not infrequently in the form of dyslexic transpositions). The book went through several iterations before going to press, and I really should have rechecked all the calculations in the final proof. I can only apologise. However, these many errors led to a very encouraging outcome. The errors were largely identified by users, who then contacted me convinced they had found an error. The statistics teaching I received never gave me similar confidence. That students were sure they were right and that I had made an error is almost the best evidence of the success of my book that I could have asked for.

So what's new in this second edition? The statistics that my book tries to teach are the concepts elaborated by R.A. Fisher in the 1930s, and today these remain the basis of the statistical procedures based on the normal distribution. I can tweak the English, change sections to make them more easily understood, and add some further extensions of Fisher's concepts – and all this I have done. I have been surprised how I, as a reader of the text of the first edition, have found sentences completely incomprehensible that were written by me as author 10 years ago! I have used the opportunity of the second edition to try to clarify these passages for both you and me! I have greatly revised and, I hope, simplified the chapter on linear regression and correlation and added some material to the chapter on chi‐square tests. However, the most significant addition has been that analysis of covariance, just briefly mentioned in the first edition, has been allocated a chapter of its own. It deserves emphasis as a valuable statistical technique, and I find most textbooks shroud it in mystery. Although I would advise you to use computer programs if you want to use analysis of covariance, I thought it a good idea to demystify it as much as I can. The calculation of numbers in analysis of covariance is usually presented as having no connection with more familiar calculations, yet they come from the standard techniques of analysis of variance and regression. The point is that only some of the resulting values are then used while the rest are discarded, but doing the analyses that include these redundant numbers does make the whole thing much more comprehensible.

I can only hope that this second edition proves as popular as the first.

Helmut F. van Emden

July 2018

Preface to the first edition

I have written/edited several books on my own speciality, agricultural and horticultural entomology, but always at the request of a publisher or colleagues. This book is different in two ways. Firstly, it is a book I have positively wanted to write for many years and secondly, I am stepping out of my ‘comfort zone’ in doing so.

The origins of this book stem from my appointment to the Horticulture Department of Reading University under Professor O. V. S. Heath, FRS. Professor Heath appreciated the importance of statistics and, at a time when there was no University‐wide provision of statistics teaching, he taught the final year students himself. I became the ‘assistant’ whose role it was to run the practical exercises which followed the Professor's lecture. You cannot teach what you do not understand yourself, but I tried nonetheless.

Eventually I took over the entire course. By then it was taught in the second year and in the third year the students went on to take a Faculty‐wide course. I did not continue the lectures; the whole course was in the laboratory where I led the students (using pocket calculators) through the calculations in stages. The laboratory class environment involved continuous interaction with students in a way totally different from what happens in lectures, and it rapidly became clear to me that many biologists have their neurons wired up in a way that makes the traditional way of teaching statistics rather difficult for them.

What my students needed was confidence – confidence that statistical ideas and methods were not just theory, but actually worked with real biological data and, above all, had some basis in logic! As the years of teaching went on, I began to realise that the students regularly found the same steps a barrier to progress and a damage to their confidence. Year after year I tried new ways to help them over these ‘crisis points’; eventually I succeeded with all of them, I am told.

The efficacy of my unusual teaching aids can actually be quantified. After the Faculty course taught by professional statisticians, my students were formally examined together with cohorts of students from other departments in the Faculty (then of ‘Agriculture and Food’) who had attended the same third year course in Applied Statistics. My students mostly (perhaps all but three per year out of some 20) appeared in the mark list as a distinct block right at the upper end, with percentage marks in the 70s, 80s and even 90s. Although there may have also been one or two from other courses with high marks, there then tended to be a gap till marks in the lower 60s appeared and began a continuum down to single figures.

I therefore feel confident that this book will be helpful to biologists with its mnemonics such as SqADS and ‘you go along the corridor before you go upstairs’ Other things previously unheard of are the ‘lead line’ and ‘supertotals’ with their ‘subscripts’ – yet all have been appreciated as most helpful by my students over the years. A riffle through the pages will amaze – where are the equations and algebraic symbols? They have to a large extent been replaced by numbers and words. The biologists I taught – and I don't think they were atypical – could work out what to do with a ‘45’, but rarely what to do with an ‘x’. Also, I have found that there are a number of statistical principles students easily forget, and then inevitably run into trouble with their calculations. These concepts are marked with the symbol of a small elephant .

The book limits itself to the basic foundations of parametric statistics, the t‐test, analysis of variance, linear regression and chi‐square. However, the reader is guided as to where there are important extensions of these techniques, and there is an introduction to non‐parametric tests which includes a check list of non‐parametric methods linked to their parametric counterparts. Many chapters end with an ‘executive summary’ as a quick source for revision, and there are additional exercises to give the practice which is so essential to learning.

In order to minimise algebra, the calculations are explained with numerical examples. These, as well as the ‘spare‐time activity’ exercises have come from many sources, and I regret the origin of many has become lost in the mists of time. Quite a number come from experiments carried out by Horticulture students at Reading as part of their second year outdoor practicals, and others have been totally fabricated in order to ‘work out’ well. Others have had numbers or treatments changed better to fit what was needed. I can only apologise to anyone whose data I have used without due acknowledgement; failure to do so is not intentional. But please remember that data have often been fabricated or massaged – therefore do not rely on the results as scientific evidence for what they appear to show!

Today, computer programmes take most of the sweat out of statistical procedures, and most biologists have access to professional statisticians. ‘Why bother to learn basic statistics?’ is therefore a perfectly fair question, akin to ‘Why keep a dog and bark?’ The question deserves an answer; to save repetition, my answer can be found towards the end of Chapter 1.

I am immensely grateful to the generations of Reading students who have challenged me to overcome their ‘hang‐ups’ and who have therefore contributed substantially to any success this book achieves. Also many postgraduate students as well as experienced visiting overseas scientists have encouraged me to turn my course into book form. My love and special thanks go to my wife Gillian who, with her own experience of biological statistics, has supported and encouraged me in writing this book; it is to her that I owe the imaginative title for the book.

Finally, I should like to thank Ward Cooper of Blackwells for having faith in this biologist, who is less terrified of statistics than he once was.

Helmut F. van Emden

December 2006

1
How to use this book

Chapter features

Introduction
The text of the chapters
What should you do if you run into trouble?
Elephants
The numerical examples in the text
Boxes
Spare–time activities
Executive summaries
Why go to all that bother?
The bibliography

Introduction

Don't be misled! This book cannot replace effort on your part. All it can aspire to do is to make that effort effective. The detective thriller only succeeds because you have read it too fast and not really concentrated – with that approach, you'll find this book just as mysterious.

In fact, you may not get very far if you just read this book at any speed! You will only succeed if you interact with the text, and how you might do this is the topic of most of this chapter.

The text of the chapters

The chapters, particularly 2–8, develop a train of thought essential to the subject of analysing biological data. You just have to take these chapters in order and quite slowly. There is only one way I know for you to maintain the concentration necessary to comprehension, and that is for you to make your own summary notes as you go along.

My Head of Department when I first joined the staff at Reading used to define a university lecture as ‘a technique for transferring information from a piece of paper in front of the lecturer to a piece of paper in front of the student, without passing through the heads of either’. That's why I stress making your own summary notes. You will retain very little by just reading the text; you'll find that after a while you've been thinking about something totally different while ‘reading’ several pages – we've all been there! The message you should take from my Head of Department's quote above is that just repeating in your writing what you are reading is little better than taking no notes at all: the secret is to digest what you have read and reproduce it in your own words and in summary form. Use plenty of headings and subheadings, boxes linked by arrows, cartoon drawings, etc. Another suggestion is to use different coloured pens for different recurring statistics, such as variance and correction factor. In fact, use anything that forces you to convert my text into as different a form as possible from the original; that will force you to concentrate, to involve your brain and to make it clear to you whether or not you have really understood that bit in the book so that it is safe to move on.

The actual process of making the notes is the critical step – you can throw the notes away at a later stage if you wish, though there's no harm in keeping them for a time for revision and reference.

So DON'T MOVE ON until you are ready. You'll only undo the value of previous effort if you persuade yourself that you are ready to move on when in your heart of hearts you know you are fooling yourself!

A key point in the book is Figure 7.5 on p. 64. Take real care to lay an especially good foundation up to there. If you really feel at home with this diagram, it is a sure sign that you have conquered any hang‐ups and are no longer a ‘terrified biologist’.

What should you do if you run into trouble?

The obvious first step is to go back to the point in the book where you last felt confident, and start again from there.

However, it often helps to see how someone else has explained the same topic, so it's a good idea to have a look at the relevant pages of a different statistics text (see Appendix D for some suggestions). You could also look up the topic on the Internet, where many statisticians have put articles and their lectures to students.

A third possibility is to see if someone can explain things to you face to face. Do you know or have access to someone who might be able to help? If you are at university, it could be a fellow student or even one of the staff. The person who tried to teach statistics to my class at university failed completely as far as I was concerned, but later on I found he could explain things to me quite brilliantly in a one‐to‐one situation.

Elephants

At certain points in the text you will find the sign of the elephant, i.e. Sign of the elephant. .

They say ‘elephants never forget’ and the symbol means just that: NEVER FORGET! I have used it to mark some key statistical concepts which, in my experience, people easily forget and as a result run into trouble later on and find it hard to see where they have gone wrong. So, take it from me that it is really well worth making sure these matters are firmly embedded in your memory.

The numerical examples in the text

As stated in the Preface to the First Edition, I soon learnt that biologists don't like x. For some reason they prefer a real number but are more prepared to accept, say, 45 as representing any number than they are an x! Therefore, in order to avoid ‘algebra’ as far as possible, I have used actual numbers to illustrate the working of statistical analyses and tests. You probably won't gain a lot by keeping up with me on a hand calculator as I describe the different steps of a calculation, but you should make sure at each step that you understand where each number in a calculation has come from and why it has been included in that way.

When you reach the end of each worked analysis or test, however, you should go back to the original source of the data in the book and try to rework on a hand calculator the calculations which follow from just those original data. Try not to look up later stages in the calculations unless you are irrevocably stuck, and then use the executive summary (if there is one at the end of the chapter) rather than the main text.

Boxes

There will be a lot of individual variation among readers of this book in the knowledge and experience of statistics they have gained in the past, and in their ability to grasp and retain statistical concepts. At certain points, therefore, some will be happy to move on without any further explanation from me or any further repetition of calculation procedures.

For those less happy to take things for granted at such points, I have placed the material and calculations they are likely to find helpful in boxes in order not to hold back or irritate the others. Calculations in the boxes may prove particularly helpful if, as suggested above, you are reworking a numerical example from the text and need to refer to a box to find out why you are stuck or perhaps where you went wrong.

Spare‐time activities

These are numerical exercises you should be equipped to complete by the time you reach them at the end of several of the chapters.

That is the time to stop and do them. Unlike the within‐chapter numerical examples, you should feel quite free to use any material in previous chapters or executive summaries to remind you of the procedures involved and guide you through them. Use a hand calculator and remember to write down the results of intermediate calculations. This will make it much easier for you to detect where you went wrong if your answers do not match the solution to that exercise given in Appendix C. Do read the beginning of that appendix early on: it explains that you should not worry or waste time recalculating if your numbers are similar, even if they are not identical. I can assure you, you will recognise – when you compare your figures with the ‘solution’ – if you have followed the statistical steps of the exercise correctly; you will also immediately recognise if you have not.

Doing these exercises conscientiously with a hand calculator or spreadsheet, and when you reach them in the book rather than much later, is really important. They are the best things in the book for impressing the subject into your long‐term memory and for giving you confidence that you understand what you are doing.

The authors of most other statistics books recognise this and also include exercises. If you're willing, I would encourage you to gain more confidence and experience by going on to try the methods as described in this book on their exercises.

By the way, a blank spreadsheet such as Excel makes a grand substitute for a hand calculator, with the added advantage that repeat calculations (e.g. squaring numbers) can be copied and pasted from the first number to all the others.

Executive summaries

Certain chapters end with such a summary, which aims to condense the meat of the chapter into little over a page or so. The summaries provide a condensed reference source for the calculations scattered throughout the previous chapter, with hopefully enough explanatory wording to jog your memory about how the calculations were derived. They will therefore prove useful when you tackle the spare‐time activities.

Why go to all that bother?

You might ask (and some of the reviews of the first edition did): why teach how to do statistical analyses on a hand calculator when we can type the data into a computer program and get all the calculations done automatically? It might have been useful once, but now…?

Well, I can assure you that you wouldn't ask that question if you had examined as many project reports and theses as I have, and seen the consequences of just ‘typing data into a computer program’. No, it does help to avoid trouble if you understand what the computer should be doing.

Why do I say that?

Planning experiments is made much more effective if you understand the advantages and disadvantages of different experimental designs and how they affect the experimental error against which we test our differences between treatments. It probably won't mean much to you now, but you really do need to understand how experimental design as well as treatment and replicate numbers impact the residual degrees of freedom and whether you should be looking at one‐tailed or two‐tailed statistical tables. My advice to my students has always been that, before embarking on an experiment, they should draw up a blank form on which to enter the results, then invent some results and complete the appropriate analysis on them. It can often cause you to think again.
Although the computer can carry out your calculations for you, it has the terminal drawback that it will accept the numbers you type in without challenging you as to whether what you are asking it to do with them is sensible. Thus – and again at this stage you'll have to accept my word that these are critical issues – no window will appear on the screen that says: ‘Whoa! You should be analysing these numbers non‐parametrically’ or ‘No problem. I can do an ordinary factorial analysis of variance, but you seem to have forgotten you actually used a split‐plot design’ or ‘These numbers are clearly pairs; why don't you exploit the advantages of pairing in the t‐test that you've told me to do?’ or ‘I'm surprised you are asking for the statistics for drawing a straight line through the points on this obvious hollow curve.’ I could go on.
You will no doubt use computer programs rather than a hand calculator for your statistical calculations in the future. But the printouts from these programs are often not particularly user‐friendly. They usually assume some knowledge of the internal structure of the analysis the computer has carried out, and abbreviations identify the numbers printed out. So obviously an understanding of what your computer program is doing and familiarity with statistical terminology can only be a help.
A really important value you will gain from this book is confidence that statistical methods are not a ‘black box’ somewhere inside a computer, but that you could in extremis (and with this book at your side) carry out the analyses and tests on the back of an envelope with a hand calculator. Also, once you have become content that the methods covered in this book are concepts you understand, you will probably be happier using the relevant computer programs.
More than that, you will probably be happy to expand the methods you use to ones I have not covered, on the basis that they are likely also to be logical, sensible, and understandable routes to passing satisfactory judgements on biological data. Expansions of the methods included in this book (e.g. those mentioned at the end of Chapter 17 ) will require you to use numbers produced by the calculations I have covered. You should be able confidently to identify which these are.
You will probably find yourself discussing your proposed experiment and later the appropriate analysis with a professional statistician. It does so help to speak the same language! Additionally, the statistician will be of much more help to you if you are competent to see where the latter has missed a constraint to the statistical advice given arising from biological realities.
Finally, there is the intellectual satisfaction of mastering a subject which can come hard to biologists. Unfortunately, you won't appreciate it was worth doing until you view the effort from the hindsight of having succeeded. I assure you the reward is real. I can still remember vividly the occasion many years ago when, in the middle of teaching an undergraduate statistics class, I realised how simple the basic idea behind the analysis of variance was, and how this extraordinary simplicity was only obfuscated for a biologist by the short‐cut calculation methods used. In other words, I was in a position to write Chapter 10. Later, the gulf between most biologists and trained statisticians was really brought home to me by one of the latter's comments on an early version of this book: ‘I suggest Chapter 10 should be deleted; it's not the way we do it.’ I rest my case!

The bibliography

Right at the back of this book is a short list of other statistics books. Very many such books have been written, and I only have personal experience of a small selection. Some of these I have found particularly helpful, either to increase my comprehension of statistics (much needed at times!) or to find details of and recipes for more advanced statistical methods. I must emphasise that I have not even seen a majority of the books that have been published and that the ones that have helped me most may not be the ones that would be of most help to you. Omission of a title from my list implies absolutely no criticism of that book, and – if you see it in the library – do look at it carefully: it could well be the best book for you.

2
Introduction

Chapter features

What are statistics?
Notation
Notation for calculating the mean

What are statistics?

Statistics are summaries or collections of numbers. If you say ‘the tallest person among my friends is 173 cm tall’, that is a statistic. It is a number based on a scrutiny of lots of different numbers – the different heights of all your friends, but reporting just the largest number.

If you say ‘the average height of my friends is 158 cm’ – then that is another statistic. It again requires you to have collected the different heights of all your friends, but this time your statistic, the average height, is a summary statistic. It has summarised all the data you collected into a single number.

If you have lots and lots and lots of friends, it may not be practical to measure them all, but you can probably get a good estimate of the average height by measuring not all of them but a large sample, and calculating the average of the sample. Now the average of your sample, particularly if it's not very big, may not be identical to the true average of all your friends. This brings us to a key principle of statistics. We are usually trying to evaluate a parameter (from the Greek for ‘beyond measurement’) by making an estimate from a sample of a practical size. So we must always distinguish parameters and estimates. One example: in statistics we use the word mean for the estimate (from a sample of numbers) of something we can rarely measure; the parameter we call the average (of the entire existing population of numbers).

Notation

‘Add together all the numbers in the sample, and divide by the number of numbers’. That's how we actually calculate a mean, isn't it? So even that very simple statistic takes 15 words to describe as a procedure. Things can get much more complicated (see Box 2.1).

We really have to find a shorthand way of expressing statistical computations, and this shorthand is called notation. The off‐putting thing about notation for biologists is that it tends to be algebraic in character. Also there is no universally accepted notation, and the variations between different textbooks are naturally pretty confusing to the beginner!

What is perhaps worse is a purely psychological problem for most biologists: your worry level has perhaps already risen at the very mention of algebra! Confront a biologist with an x instead of a number like 57 and there is a tendency to switch off the receptive centres of the brain altogether. Yet most statistical calculations involve nothing more terrifying than addition, subtraction, multiplication, and division – though I must admit you will also have to square numbers and find square roots. All this can now be done with the cheapest hand calculators.

Most of you now own or have access to a computer, where you only have to type the sampled numbers into a spreadsheet or other program and the machine has all the calculations that have to be done already programmed. So do computers remove the need to understand what their programs are doing? I don't think so! I discussed all this more fully in Chapter 1, but repeat it here in case you have skipped that chapter. Briefly, you need to know what programs are right for what sort of data, and what the limitations are. So an understanding of data analysis will enable you to plan more effective experiments. Remember that the computer will be quite content to process your figures perfectly inappropriately, if that is what you request! It may also be helpful to know how to interpret the final printout… correctly.

Back to the subject of notation. As I have just pointed out, we are going to be involved in quite unsophisticated number‐crunching and the whole point of notation is to remind us of the order in which we do this. Note I say ‘remind us’ and not ‘tell us’. It would be quite dangerous, when you start out, to think that notation enables you to do the calculations without previous experience. You just can't turn to p. 257 of a statistics book and expect to tackle something like:

without sufficient homework on the notation for it to remind you of what you have previously learnt! Incidentally, Box 2.1 translates this algebraic notation into English for two columns of paired numbers (values of x and y).

As pointed out earlier, the formula above gives you the order of using the three component in the calculation, i.e. you add together the two components under the line and then divide the top component by this sum. But the formula doesn't tell you how to calculate the three components. For this you have to identify the three components in English as:

These terms will probably mean nothing to you at this stage, but being able to calculate the sum of squares of a set of numbers is just about as common a procedure as calculating the mean.

We frequently have to cope with notation elsewhere in life. Recognise 03.11.92? It's a date, perhaps a ‘date of birth’. The Americans use a different notation; they would write the same birthday as 11.03.92. And do you recognise C_m? You probably do if you are a musician: it's notation for playing together the three notes C, E_b, and G – the chord of C minor (hence C_m).

In this book, the early chapters include notation to help you remember what statistics (i.e. summaries of data) such as sums of squares are and how they are calculated. However, as soon as possible, we will be using keywords such as sums of squares to replace blocks of algebraic notation. This should make the pages less frightening and make the text flow better. After all, you can always go back to an earlier chapter if you need reminding of the notation and the calculation it represents. I guess it's a bit like cooking! The first half‐dozen times you want to make pancakes you need the cookbook to provide the information that 300 ml milk goes with 125 g plain flour, one egg, and some salt, but the day soon comes when you merely think the word batter!

Notation for calculating the mean

No one is hopefully going to baulk at the challenge of calculating the mean height of a sample of five people, say 149, 176, 152, 180, and 146 cm, by totalling the numbers and dividing by 5.

In statistical notation, the instruction to total is ∑, and the whole series of numbers to total is called by a letter, often x or y. So ∑x means ‘add together all the numbers in the series called x’, the five heights of people in our example. We use n for the ‘number of numbers’, 5 in the example, making the full notation for a mean:

However, we use the mean so often that it has another even shorter notation: the identifying letter for the series (e.g. x) with a line over the top, i.e. images .