Cover: Medical Statistics: at a Glance, Fourth Edition by Aviva Petrie

Also available to buy!

Medical Statistics at a Glance Workbook

image

A comprehensive workbook containing a variety of examples and exercises, complete with model answers, designed to support your learning and revision.

Fully cross-referenced to Medical Statistics at a Glance, this workbook includes:

Medical Statistics at a Glance Workbook is the ideal resource to improve statistical knowledge together with your analytical and interpretational skills.

Medical Statistics

at a Glance




Fourth Edition





Aviva Petrie

Honorary Associate Professor of Biostatistics

UCL Eastman Dental Institute

London, UK



Caroline Sabin

Professor of Medical Statistics and Epidemiology

Institute for Global Health

UCL

London, UK



Logo

Preface

Medical Statistics at a Glance is directed at undergraduate medical students, medical researchers, postgraduates in the biomedical disciplines and at pharmaceutical industry personnel. All of these individuals will, at some time in their professional lives, be faced with quantitative results (their own or those of others) which will need to be critically evaluated and interpreted, and some, of course, will have to pass that dreaded statistics exam! A proper understanding of statistical concepts and methodology is invaluable for these needs. Much as we should like to fire the reader with an enthusiasm for the subject of statistics, we are pragmatic. Our aim in this new edition, as it was in the earlier editions, is to provide the student and the researcher, as well as the clinician encountering statistical concepts in the medical literature, with a book which is sound, easy to read, comprehensive, relevant, and of useful practical application.

We believe Medical Statistics at a Glance will be particularly helpful as an adjunct to statistics lectures and as a reference guide. The structure of this fourth edition is the same as that of the first three editions. In line with other books in the At a Glance series, we lead the reader through a number of self-contained two-, three- or occasionally four-page chapters, each covering a different aspect of medical statistics. There is extensive cross-referencing throughout the text to help the reader link the various procedures. We have learned from our own teaching experiences and have taken account of the difficulties that our students have encountered when studying medical statistics. For this reason, we have chosen to limit the theoretical content of the book to a level that is sufficient for understanding the procedures involved, yet which does not overshadow the practicalities of their execution.

Medical statistics is a wide-ranging subject covering a large number of topics. We have provided a basic introduction to the underlying concepts of medical statistics and a guide to the most commonly used statistical procedures. Epidemiology, concerned with the distribution and determinants of disease in specified populations, is closely allied to medical statistics. Hence some of the main issues in epidemiology, relating to study design and interpretation, are discussed. Also included are chapters that the reader may find useful only occasionally, but which are, nevertheless, fundamental to many areas of medical research; for example, evidence-based medicine, systematic reviews and meta-analysis, survival analysis, Bayesian methods and the development of prognostic scores. We have explained the principles underlying these topics so that the reader will be able to understand and interpret the results from them when they are presented in the literature.

A basic set of statistical tables is contained in Appendix A. Neave, H.R. (1995) Elementary Statistical Tables, Routledge: London, and Diem, K. Lenter, C. and Seldrup (1981) Geigy Scientific Tables, 8th rev. and enl. edition, Basle: Ciba-Geigy, amongst others, provide fuller versions if the reader requires more precise results for hand calculations. We have included a new appendix, Appendix D, in this fourth edition. This appendix contains guidelines for randomized controlled trials (the CONSORT checklist and flow chart) and observational studies (the STROBE checklist). The CONSORT and STROBE checklists are produced by the EQUATOR Network, initiated with the objectives of providing resources and training for the reporting of health research. Guidelines for the presentation of study results are now available for many other types of study and we provide website addresses in a table in Appendix D for some of these designs. Appendix D also contains templates that we hope you will find useful when you critically appraise or evaluate the evidence in randomized controlled trials and observational studies. The use of these templates to critically appraise two published papers is demonstrated in our Medical Statistics at a Glance Workbook. Due to the inclusion of the new Appendix D, the labeling of the final two appendices differs from that of the third edition: Appendix E now contains the Glossary of terms with readily accessible explanations of commonly used terminology, and Appendix F provides cross-referencing of multiple choice and structured questions from Medical Statistics at a Glance Workbook.

The chapter titles of this fourth edition are identical to those of the third edition. Some of the first 46 chapters remain unaltered in this new edition and some have relatively minor changes which accommodate recent advances, cross-referencing or re-organization of the new material. In particular, where appropriate, we have provided references to the relevant EQUATOR guidelines.

As in the third edition, we provide a set of learning objectives for each chapter. Each set provides a framework for evaluating understanding and progress. If you are able to complete all the bulleted tasks in a chapter satisfactorily, you will have mastered the concepts in that chapter.

Most of the statistical techniques described in the book are accompanied by examples illustrating their use. We have replaced many of the older examples that were in previous editions by those that are commensurate with current clinical research. We have generally obtained the data for our examples from collaborative studies in which we or colleagues have been involved; in some instances, we have used real data from published papers. Where possible, we have used the same data set in more than one chapter to reflect the reality of data analysis, which is rarely restricted to a single technique or approach. Although we believe that formulae should be provided and the logic of the approach explained as an aid to understanding, we have avoided showing the details of complex calculations – most readers will have access to computers and are unlikely to perform any but the simplest calculations by hand.

We consider that it is particularly important for the reader to be able to interpret output from a computer package. We have therefore chosen, where applicable, to show results using extracts from computer output. In some instances, where we believe individuals may have difficulty with its interpretation, we have included (Appendix C) and annotated the complete computer output from an analysis of a data set. There are many statistical packages in common use; to give the reader an indication of how output can vary, we have not restricted the output to a particular package and have, instead, used four well-known ones – SAS, SPSS, Stata and R.

We know that one of the greatest difficulties facing non-statisticians is choosing the appropriate technique. We have therefore produced two flow charts which can be used both to aid the decision as to what method to use in a given situation and to locate a particular technique in the book easily. These flow charts are displayed prominently on the inside back cover for easy access.

The reader may find it helpful to assess his/her progress in self-directed learning by attempting the interactive exercises on our website (www.medstatsaag.com) or the multiple choice and structured questions, all with model answers, in our Medical Statistics at a Glance Workbook. The website also contains a full set of references (some of which are linked directly to Medline) to supplement the references quoted in the text and provide useful background information for the examples. For those readers who wish to gain a greater insight into particular areas of medical statistics, we can recommend the following books:

We are extremely grateful to Mark Gilthorpe and Jonathan Sterne who made invaluable comments and suggestions on aspects of the second edition, and to Richard Morris, Fiona Lampe, Shak Hajat and Abul Basar for their counsel on the first edition. We wish to thank everyone who has helped us by providing data for the examples. Naturally, we take full responsibility for any errors that remain in the text or examples. We should also like to thank Mike, Gerald, Nina, Andrew and Karen who tolerated, with equanimity, our preoccupation with the first three editions and for their unconditional support, patience and encouragement as we laboured to produce this fourth edition.

Aviva Petrie

Caroline Sabin

London

Part 1
Handling data

Chapters

  1. 1 Types of data
  2. 2 Data entry
  3. 3 Error checking and outliers
  4. 4 Displaying data diagrammatically
  5. 5 Describing data: the ‘average’
  6. 6 Describing data: the ‘spread’
  7. 7 Theoretical distributions: the Normal distribution
  8. 8 Theoretical distributions: other distributions
  9. 9 Transformations

1
Types of data

Learning objectives

By the end of this chapter, you should be able to:

  • Distinguish between a sample and a population
  • Distinguish between categorical and numerical data
  • Describe different types of categorical and numerical data
  • Explain the meaning of the terms: variable, percentage, ratio, quotient, rate, score
  • Explain what is meant by censored data

Relevant Workbook questions: MCQs 1, 2 and 16; and SQ 1 available online

Data and statistics

The purpose of most studies is to collect data to obtain information about a particular area of research. Our data comprise observations on one or more variables; any quantity that varies is termed a variable. For example, we may collect basic clinical and demographic information on patients with a particular illness. The variables of interest may include the sex, age and height of the patients.

Our data are usually obtained from a sample of individuals that represents the population of interest. Our aim is to condense these data in a meaningful way and extract useful information from them. Statistics encompasses the methods of collecting, summarizing, analysing and drawing conclusions from the data: we use statistical techniques to achieve our aim.

Data may take many different forms. We need to know what form every variable takes before we can make a decision regarding the most appropriate statistical methods to use. Each variable and the resulting data will be one of two types: categorical or numerical (Fig. 1.1).

The figure shows a flow diagram illustrating the different types of variable.

Categorical (qualitative) data

These occur when each individual can only belong to one of a number of distinct categories of the variable.

  • Nominal data – the categories are not ordered but simply have names. Examples include blood group (A, B, AB and O) and marital status (married/widowed/single, etc.). In this case, there is no reason to suspect that being married is any better (or worse) than being single!
  • Ordinal data – the categories are ordered in some way. Examples include disease staging systems (advanced, moderate, mild, none) and degree of pain (severe, moderate, mild, none).

A categorical variable is binary or dichotomous when there are only two possible categories. Examples include ‘Yes/No’, ‘Dead/Alive’ or ‘Patient has disease/Patient does not have disease’.

Numerical (quantitative) data

These occur when the variable takes some numerical value. We can subdivide numerical data into two types.

  • Discrete data – occur when the variable can only take certain whole numerical values. These are often counts of numbers of events, such as the number of visits to a GP in a particular year or the number of episodes of illness in an individual over the last 5 years.
  • Continuous data – occur when there is no limitation on the values that the variable can take, e.g. weight or height, other than that which restricts us when we make the measurement.

Distinguishing between data types

We often use very different statistical methods depending on whether the data are categorical or numerical. Although the distinction between categorical and numerical data is usually clear, in some situations it may become blurred. For example, when we have a variable with a large number of ordered categories (e.g. a pain scale with seven categories), it may be difficult to distinguish it from a discrete numerical variable. The distinction between discrete and continuous numerical data may be even less clear, although in general this will have little impact on the results of most analyses. Age is an example of a variable that is often treated as discrete even though it is truly continuous. We usually refer to ‘age at last birthday’ rather than ‘age’, and therefore, a woman who reports being 30 may have just had her 30th birthday, or may be just about to have her 31st birthday.

Do not be tempted to record numerical data as categorical at the outset (e.g. by recording only the range within which each patient’s age falls rather than his/her actual age) as important information is often lost. It is simple to convert numerical data to categorical data once they have been collected.

Derived data

We may encounter a number of other types of data in the medical field. These include:

  • Percentages – these may arise when considering improvements in patients following treatment, e.g. a patient’s lung function (forced expiratory volume in 1 second, FEV1) may increase by 24% following treatment with a new drug. In this case, it is the level of improvement, rather than the absolute value, which is of interest.
  • Ratios or quotients – occasionally you may encounter the ratio or quotient of two variables. For example, body mass index (BMI), calculated as an individual’s weight (kg) divided by her/his height squared (m2), is often used to assess whether s/he is over- or underweight.
  • Rates – disease rates, in which the number of disease events occurring among individuals in a study is divided by the total number of years of follow-up of all individuals in that study (Chapter 31), are common in epidemiological studies (Chapter 12).
  • Scores – we sometimes use an arbitrary value, such as a score, when we cannot measure a quantity. For example, a series of responses to questions on quality of life may be summed to give some overall quality of life score on each individual.

All these variables can be treated as numerical variables for most analyses. Where the variable is derived using more than one value (e.g. the numerator and denominator of a percentage), it is important to record all of the values used. For example, a 10% improvement in a marker following treatment may have different clinical relevance depending on the level of the marker before treatment.

Censored data

We may come across censored data in situations illustrated by the following examples.

  • If we measure laboratory values using a tool that can only detect levels above a certain cut-off value, then any values below this cut-off will not be detected, i.e. they are censored. For example, when measuring virus levels, those below the limit of detectability will often be reported as ‘undetectable’ or ‘unquantifiable’ even though there may be some virus in the sample. In this situation, if the lower cut-off of a tool is x, say, the results may be reported as ‘ < x’. Similarly, some tools may only be able to reliably quantify levels below a certain cut-off value, say y; any measurements above that value will also be censored and the test result may be reported as ‘ > y’.
  • We may encounter censored data when following patients in a trial in which, for example, some patients withdraw from the trial before the trial has ended. This type of data is discussed in more detail in Chapter 44.

2
Data entry

Learning objectives

By the end of this chapter, you should be able to:

  • Describe different formats for entering data on to a computer
  • Outline the principles of questionnaire design
  • Distinguish between single-coded and multi-coded variables
  • Describe how to code missing values

Relevant Workbook questions: MCQs 1, 3 and 4; and SQ 1 available online

When you carry out any study you will almost always need to enter the data into a computer package. Computers are invaluable for improving the accuracy and speed of data collection and analysis, making it easy to check for errors, produce graphical summaries of the data and generate new variables. It is worth spending some time planning data entry – this may save considerable effort at later stages.

Formats for data entry

There are a number of ways in which data can be entered and stored on a computer. Most statistical packages allow you to enter data directly. However, the limitation of this approach is that often you cannot move the data to another package. A simple alternative is to store the data in either a spreadsheet or database package. Unfortunately, their statistical procedures are often limited, and it will usually be necessary to output the data into a specialist statistical package to carry out analyses.

A more flexible approach is to have your data available as an ASCII or text file. Once in an ASCII format, the data can be read by most packages. ASCII format simply consists of rows of text that you can view on a computer screen. Usually, each variable in the file is separated from the next by some delimiter, often a space or a comma. This is known as free format.

The simplest way of entering data in ASCII format is to type the data directly in this format using either a word processing or editing package. Alternatively, data stored in spreadsheet packages can be saved in ASCII format. Using either approach, it is customary for each row of data to correspond to a different individual in the study, and each column to correspond to a different variable, although it may be necessary to go on to subsequent rows if data from a large number of variables are collected on each individual.

Planning data entry

When collecting data in a study you will often need to use a form or questionnaire for recording the data. If these forms are designed carefully, they can reduce the amount of work that has to be done when entering the data. Generally, these forms/questionnaires include a series of boxes in which the data are recorded – it is usual to have a separate box for each possible digit of the response.

Categorical data

Some statistical packages have problems dealing with non-numerical data. Therefore, you may need to assign numerical codes to categorical data before entering the data into the computer. For example, you may choose to assign the codes of 1, 2, 3 and 4 to categories of ‘no pain’, ‘mild pain’, ‘moderate pain’ and ‘severe pain’, respectively. These codes can be added to the forms when collecting the data. For binary data, e.g. yes/no answers, it is often convenient to assign the codes 1 (e.g. for ‘yes’) and 0 (for ‘no’).

Numerical data

Numerical data should be entered with the same precision as they are measured, and the unit of measurement should be consistent for all observations on a variable. For example, weight should be recorded in kilograms or in pounds, but not both interchangeably.

Multiple forms per patient

Sometimes, information is collected on the same patient on more than one occasion. It is important that there is some unique identifier (e.g. a serial number) relating to the individual that will enable you to link all of the data from an individual in the study.

Problems with dates and times

Dates and times should be entered in a consistent manner, e.g. either as day/month/year or month/day/year, but not interchangeably. It is important to find out what format the statistical package can read.

Coding missing values

You should consider what you will do with missing values before you enter the data. In most cases you will need to use some symbol to represent a missing value. Statistical packages deal with missing values in different ways. Some use special characters (e.g. a full stop or asterisk) to indicate missing values, whereas others require you to define your own code for a missing value (commonly used values are 9, 999 or − 99). The value that is chosen should be one that is not possible for that variable. For example, when entering a categorical variable with four categories (coded 1, 2, 3 and 4), you may choose the value 9 to represent missing values. However, if the variable is ‘age of child’ then a different code should be chosen. Missing data are discussed in more detail in Chapter 3.

Example

The figure shows a screenshot illustrating a small portion of a spreadsheet showing data collected on a sample of 64 women with inherited bleeding disorders.

As part of a study on the effect of inherited bleeding disorders on pregnancy and childbirth, data were collected on a sample of 64 women registered at a single haemophilia centre in London. The women were asked questions relating to their bleeding disorder and their first pregnancy (or their current pregnancy if they were pregnant for the first time on the date of interview). Figure 2.1 shows the data from a small selection of the women after the data have been entered onto a spreadsheet, but before they have been checked for errors. The coding schemes for the categorical variables are shown at the bottom of Fig. 2.1. Each row of the spreadsheet represents a separate individual in the study; each column represents a different variable. Where the woman is still pregnant, the age of the woman at the time of birth has been calculated from the estimated date of the baby’s delivery. Data relating to the live births are summarized in Table 37.1 in Chapter 37.

Data kindly provided by Dr R.A. Kadir, University Department of Obstetrics and Gynaecology, and Professor C.A. Lee, Haemophilia Centre and Haemostasis Unit, Royal Free Hospital, London.

3
Error checking and outliers

Learning objectives

By the end of this chapter, you should be able to:

  • Describe how to check for errors in data
  • Distinguish between data that are missing completely at random, missing at random and missing not at random
  • Outline the methods of dealing with missing data, distinguishing between single and multiple imputation
  • Define an outlier
  • Explain how to check for and handle outliers

Relevant Workbook questions: MCQs 5 and 6; and SQs 1 and 28 available online

In any study there is always the potential for errors to occur in a data set, either at the outset when taking measurements, or when collecting, transcribing and entering the data into a computer. It is hard to eliminate all of these errors. However, you can reduce the number of typing and transcribing errors by checking the data carefully once they have been entered. Simply scanning the data by eye will often identify values that are obviously wrong. In this chapter we suggest a number of other approaches that you can use when checking data.

Typing errors

Typing mistakes are the most frequent source of errors when entering data. If the amount of data is small, then you can check the typed data set against the original forms/questionnaires to see whether there are any typing mistakes. However, this is time-consuming if the amount of data is large. It is possible to type the data in twice and compare the two data sets using a computer program. Any differences between the two data sets will reveal typing mistakes. Although this approach does not rule out the possibility that the same error has been incorrectly entered on both occasions, or that the value on the form/questionnaire is incorrect, it does at least minimize the number of errors. The disadvantage of this method is that it takes twice as long to enter the data, which may have major cost or time implications.

Error checking

With all error checks, a value should only be corrected if there is evidence that a mistake has been made. You should not change values simply because they look unusual.

Handling missing data

There is always a chance that some data will be missing. If a large proportion of the data is missing, then the results are unlikely to be reliable. The reasons why data are missing should always be investigated – if missing data tend to cluster on a particular variable and/or in a particular subgroup of individuals, then it may indicate that the variable is not applicable or has never been measured for that group of individuals. If this is the case, it may be necessary to exclude that variable or group of individuals from the analysis. There are different types of missing data1:

Provided the missing data are not MNAR, we may be able to estimate (impute1) the missing data2. A simple approach is to replace a missing observation by the mean of the existing observations for that variable or, if the data are longitudinal, by the last observed value. These are examples of single imputation. In multiple imputation, we create a number (generally up to five) of imputed data sets from the original data set, with the missing values replaced by imputed values which are derived from an appropriate model that incorporates random variation. We then use standard statistical procedures on each complete imputed data set and finally combine the results from these analyses. Alternative statistical approaches to dealing with missing data are available2, but the best option is to minimize the amount of missing data at the outset.

Outliers

What are outliers?

Outliers are observations that are distinct from the main body of the data, and are incompatible with the rest of the data. These values may be genuine observations from individuals with very extreme levels of the variable. However, they may also result from typing errors or the incorrect choice of units, and so any suspicious values should be checked. It is important to detect whether there are outliers in the data set, as they may have a considerable impact on the results from some types of analyses (Chapter 29).

For example, a woman who is 7 feet tall would probably appear as an outlier in most data sets. However, although this value is clearly very high, compared with the usual heights of women, it may be genuine and the woman may simply be very tall. In this case, you should investigate this value further, possibly checking other variables such as her age and weight, before making any decisions about the validity of the result. The value should only be changed if there really is evidence that it is incorrect.

Checking for outliers

A simple approach is to print the data and visually check them by eye. This is suitable if the number of observations is not too large and if the potential outlier is much lower or higher than the rest of the data. Range checking should also identify possible outliers. Alternatively, the data can be plotted in some way (Chapter 4) – outliers can be clearly identified on histograms and scatter plots (see also Chapter 29 for a discussion of outliers in regression analysis).

Handling outliers

It is important not to remove an individual from an analysis simply because his/her values are higher or lower than might be expected. However, the inclusion of outliers may affect the results when some statistical techniques are used. A simple approach is to repeat the analysis both including and excluding the value – this is a type of sensitivity analysis (Chapter 35). If the results are similar, then the outlier does not have a great influence on the result. However, if the results change drastically, it is important to use appropriate methods that are not affected by outliers to analyse the data. These include the use of transformations (Chapter 9) and non-parametric tests (Chapter 17).

Example

The figure shows a screenshot illustrating a small portion of a spreadsheet showing how to check errors in a data set.

After entering the data described in Chapter 2, the data set is checked for errors (Fig. 3.1). Some of the inconsistencies highlighted are simple data entry errors. For example, the code of ‘41’ in the ‘Sex of baby’ column is incorrect as a result of the sex information being missing for patient 20; the rest of the data for patient 20 had been entered in the incorrect columns. Others (e.g. unusual values in the gestational age and weight columns) are likely to be errors, but the notes should be checked before any decision is made, as these may reflect genuine outliers. In this case, the gestational age of patient number 27 was 41 weeks, and it was decided that a weight of 11.19 kg was incorrect. As it was not possible to find the correct weight for this baby, the value was entered as missing.

References

  1. Bland, M. (2015) An Introduction to Medical Statistics. 4th edition. Oxford University Press.
  2. Horton, N.J. and Kleinman, K.P. (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. American Statistician, 61(1), 71–90.