Cover Page

Illuminating Statistical Analysis Using Scenarios and Simulations

Jeffrey E Kottemann Ph.D.

Wiley Logo

Preface

The goal of this book is to help people develop an assortment of key intuitions about statistics and inference and use those intuitions to make sense of statistical analysis methods in a conceptual as well as a practical way. Moreover, I hope to engender good ways of thinking about uncertainty. The book is comprised of a series of short, concise chapters that build upon each other and are best read in order. The chapters cover a wide range of concepts and methods of classical (frequentist) statistics and inference. (There are also appendices on Bayesian statistics and on data mining techniques.)

Examining computer simulation results is central to our investigation. Simulating pollsters, for example, who survey random people for responses to an agree or disagree opinion question not only mimics reality but also has the added advantage of being able to employ 1000 independent pollsters simultaneously. The results produced by such simulations provide an eye-opening way to (re)discover the properties of sample statistics, the role of chance, and to (re)invent corresponding principles of statistical inference. The simulation results also foreshadow the various mathematical formulas that underlie statistical analysis.

Mathematics used in the book involves basic algebra. Of particular relevance is interpreting the relationships found in formulas. Take, for example, img. As img increases, img increases because img is the numerator of the fraction. And as img increases, img decreases because img is the denominator. Going one step further, we could have img. Here, as img increases, img decreases, img increases, so img increases. These functional forms mirror most of the statistical formulas we will encounter.

As we will see for a wide range of scenarios, simulation results clearly illustrate the terms and relationships found in the various formulas that underlie statistical analysis methods. They also bring to light the underlying assumptions that those formulas and methods rely upon. Last, but not least, we will see that simulation can serve as a robust statistical analysis method in its own right.

Bon voyage
Jeffrey E. Kottemann

Acknowledgements

My thanks go to Dan Dolk, Gene Hahn, Fati Salimian, and Kathie Wright for their feedback and encouragement. At John Wiley & Sons, thanks go to Susanne Steitz-Filler, Kathleen Pagliaro, Vishnu Narayanan, and Shikha Pahuja.

Part I
Sample Proportions and the Normal Distribution

A bar graphical representation for 10 flips of a fair coin, or 10 random people surveyed in an evenly split community repeated 1000 times, where frequency out of 1000 is plotted on the y-axis on a scale of 0–300 and proportion of heads on the x-axis on a scale of 0–1.

1
Evidence and Verdicts

Before we focus in on using statistics as evidence to be used in making judgments, let's take a look at a widely used “verdict outcomes framework.” This general framework is useful for framing judgments in a wide range of situations, including those encountered in statistical analysis.

Anytime we use evidence to arrive at a judgment, there are four generic outcomes possible, as shown in Table 1.1. Two outcomes correspond to correct judgments and two correspond to incorrect judgments, although we rarely know whether our judgments are correct or incorrect. Consider a jury trial in U.S. criminal court. Ideally, the jury is always correct, judging innocent defendants not guilty and judging guilty defendants guilty. Evidence is never perfect, though, and so juries will make erroneous judgments, judging innocent defendants guilty or guilty defendants not guilty.

Table 1.1 Verdict outcomes framework.

img

In U.S. criminal court, the presumption is that a defendant is innocent until “proven” guilty. Further, convention in U.S. criminal court has it that we are more afraid of punishing an innocent person (type I error) than we are of letting a guilty person go unpunished (type II error). Because of this fear, the threshold for a guilty verdict is set high: “Beyond a reasonable doubt.” So, convicting an innocent person should be a relatively unlikely outcome. In U.S. criminal court, we are willing to have a greater chance of letting a guilty person go unpunished than we are of punishing an innocent person. In short, we need to be very sure before we reject the presumption of innocence and render a verdict of guilty in U.S. criminal court.

We can change the relative chances of the two types of error by changing the threshold. Say we change from “beyond a reasonable doubt” to “a preponderance of evidence.” The former is the threshold used in U.S. criminal court, and the latter is the threshold used in U.S. civil court. Let's say that the former corresponds to being 95% sure before judging a defendant guilty and that the latter corresponds to being 51% sure before judging a defendant guilty. You can imagine cases where the same evidence results in different verdicts in criminal and civil court, which indeed does happen. For example, say that the evidence leads to the jury being 60% sure of the defendant's guilt. The jury verdict in criminal court would be not guilty (60% < 95%) but the jury verdict in civil court would be guilty (60% > 51%). Compared to criminal court, civil court is more likely to declare an innocent person guilty (type I error), but is also less likely to declare a guilty person not guilty (type II error).

Statistical analysis is conducted as if in criminal court. Below are a number of jury guidelines that have parallels in statistical analysis, as we'll see repeatedly.

Statistical analysis formally evaluates evidence in order to determine whether to reject or not reject a stated presumption, and it is primarily concerned with limiting the chances of type I error. Further, the amount of evidence and the variance of evidence are key characteristics of evidence that are formally incorporated into the evaluation process. In what follows, we'll see how this is accomplished.

2
Judging Coins I

Let's start with the simplest statistical situation: that of judging whether a coin is fair or not fair. Later we'll see that this situation is statistically equivalent to agree or disagree opinion polling. A coin is fair if it has a 50% chance of coming up heads, and a 50% chance of coming up tails when you flip it. Adjusting the verdict table to the coin-flipping context gives us Table 2.1.

Table 2.1 Coin flipping outcomes.

img

Intuitively, it seems extremely unlikely for a fair coin to come up heads only 0 or 1 times out of 10, and most people would arrive at the verdict that the coin is not fair. Likewise, it seems extremely unlikely for a fair coin to come up heads 9 or 10 times out of 10, and most people would arrive at the verdict that the coin is not fair. On the other hand, it seems fairly likely for a fair coin to come up heads 4, 5, or 6 times out of 10, and so most people would say that the coin seems fair. But what about 2, 3, 7, or 8 heads? Let's experiment.

Shown in Figure 2.1 is a histogram of what actually happened (in simulation) when 1000 people each flipped a fair coin 10 times. This shows us how fair coins tend to behave. The horizontal axis is the number of heads that came up out of 10. The vertical axis shows the number of people out of the 1000 who came up with the various numbers of heads.

A bar graphical representation for 1000 people each flip a fair coin 10 times, where frequency out of 1000 is plotted on the y-axis on a scale of 0–300 and number of heads on the x-axis on a scale of 0–10.

Figure 2.1

Appendix B gives step-by-step instructions for constructing this simulation using common spreadsheet software; guidelines are also given for constructing additional simulations found in the book.

Sure enough, fair coins very rarely came up heads 0, 1, 9, or 10 times. And, sure enough, they very often came up heads 4, 5, or 6 times. What about 2, 3, 7, or 8 heads?

Notice that 2 heads came up a little less than 50 times out of 1000, or near 5% of the time. Same with 8 heads. And, 3 heads came up well over 100 times out of 1000, or over 10% of the time. Same with 7 heads.

3
Brief on Bell Shapes

Before expanding the previous Statistical Scenario let's briefly explore why the histogram, reproduced in Figure 3.1, is shaped the way it is: bell-shaped. It tapers off symmetrically on each side from a single peak in the middle.

A bar graphical representation 1000 people each flip a fair coin 10 times, where frequency out of 1000 is plotted on the y-axis on a scale of 0–300 and number of heads on the x-axis on a scale of 0–10.

Figure 3.1

Since each coin flip has two possible outcomes and we are considering ten separate outcomes together, there are a total of img unique possible patterns (permutations) of heads and tails with 10 flips of a coin. Of these, there is only one with 0 heads and only one with 10 heads. These are the least likely outcomes.

TTTTTTTTTT HHHHHHHHHH

There are ten with 1 head, and ten with 9 heads:

HTTTTTTTTT THHHHHHHHH
THTTTTTTTT HTHHHHHHHH
TTHTTTTTTT HHTHHHHHHH
TTTHTTTTTT HHHTHHHHHH
TTTTHTTTTT HHHHTHHHHH
TTTTTHTTTT HHHHHTHHHH
TTTTTTHTTT HHHHHHTHHH
TTTTTTTHTT HHHHHHHTHH
TTTTTTTTHT HHHHHHHHTH
TTTTTTTTTH HHHHHHHHHT

Since there are 10 times more ways to get 1 or 9 heads than 0 or 10 heads, we expect to flip 1 or 9 heads 10 times more often than 0 or 10 heads.

Further, there are 45 ways to get 2 or 8 heads, 120 ways to get 3 or 7 heads, and 210 ways to get 4 or 6 heads. Finally, there are 252 ways to get 5 heads, which is the most likely outcome and therefore the most frequently expected outcome. Notice how the shape of the histogram of simulation outcomes we saw in Figure 3.1 closely mirrors the number of ways (#Ways) chart that is shown in Figure 3.2.

A bar graphical representation ways out of a total of 1024 ways, where ways out of 1024 is plotted on the y-axis on a scale of 0–300 and number of heads on the x-axis on a scale of 0–10.

Figure 3.2

You don't need to worry about calculating #ways. Soon we won't need such calculations. Just for the record, the formula for the #ways is img where img is the number of flips, h is the number of heads you are interested in, and ! is the factorial operation (example: img). In official terms, #ways is the number of combinations of img things taken img at a time.

4
Judging Coins II

Let's revisit Statistical Scenario–Coins #1, now with additional information on each of the possible outcomes. Table 4.1 summarizes this additional information. As noted, there are a total of img different unique patterns of heads & tails possible when we flip a coin 10 times. For any given number of heads, as we have just seen, there are one or more ways to get that number of heads.

Table 4.1 Coin flipping details.

#Heads #Ways Expected relative frequency Probability as Percent Rounded
0 1 1/1024 0.00098 0.098% 0.1%
1 10 10/1024 0.00977 0.977% 1.0%
2 45 45/1024 0.04395 4.395% 4.4%
3 120 120/1024 0.11719 11.719% 11.7%
4 210 210/1024 0.20508 20.508% 20.5%
5 252 252/1024 0.24609 24.609% 24.6%
6 210 210/1024 0.20508 20.508% 20.5%
7 120 120/1024 0.11719 11.719% 11.7%
8 45 45/1024 0.04395 4.395% 4.4%
9 10 10/1024 0.00977 0.977% 1.0%
10 1 1/1024 0.00098 0.098% 0.1%
Totals: 1024 1024/1024 1.0 100% 100%

The #ways divided by 1024 gives us the expected relative frequency for that number of heads expressed as a fraction. For example, we expect to get 5 heads 252/1024ths of the time. The fraction can also be expressed as a decimal value. This decimal value can be viewed as the probability that a certain number of heads will come up in 10 flips. For example, the probability of getting 5 heads is approximately 0.246. We can also express this as a percentage, 24.6%.

A probability of 1 (100%) means something will always happen and a probability of 0 (0%) means something will never happen. A probability of 0.5 (50%) means something will happen half the time. The sum of the probabilities of the entire set of possible outcomes is the sum of all the probabilities and always equals 1 (100%). The probability of a subset of possible outcomes can be calculated by summing the probabilities of each of the outcomes. For example, using the rounded percentages from the table, the probability of 2 or fewer heads is img.

Notice how the bars of our simulation histogram, reproduced in Figure 4.1, reflect the corresponding probabilities in Table 4.1.

A bar graphical representation 1000 people each flip a fair coin 10 times, where frequency out of 1000 is plotted on the y-axis on a scale of 0–300 and number of heads on the x-axis on a scale of 0–10.

Figure 4.1

Say someone gives you a coin to test. When you flip the coin 10 times, you are sampling the coin's behavior 10 times. The number of heads you toss is your evidence. Based on this evidence you must decide whether to reject your presumption of fairness and judge the coin as not fair.

What happens if you make your “verdict rule” to be:

Verdict “coin is not fair” if outside the interval #heads ≥ 1 and ≤ 9 as shown in Table 4.2 and the accompanying Figure 4.2?

From the Statistical Scenario Table 4.1, we can see that a fair coin will come up 0 heads about 0.1% of the time, and 10 heads about 0.1% of the time. The sum is about 0.2% of the time, or about 2 out of 1000. So, it will be extremely rare for us to make a type I error and erroneously call a fair coin unfair because fair coins will almost never come up with 0 or 10 heads. However, what about 1 head or 9 heads? Our rule says not to call those coins unfair. But a fair coin will only come up 1 head or 9 heads about img of the time. Therefore, we may end up misjudging many unfair coins that come up heads one or nine times because we'll declare them to be fair coins. That is type II error.

Table 4.2 First verdict rule scenario.

img
A bar graphical representation for # heads ?1 and ?9, where frequency out of 1000 is plotted on the y-axis on a scale of 0–300 and number of heads on the x-axis on a scale of 0–10.

Figure 4.2

Determining the chance of type II error is too involved for discussion now (that is Chapter 17), but recall from Chapter 1 that increasing the chance of type I error decreases the chance of type II error, and vice versa.

To lower the chances of type II error, we can narrow our “verdict rule” interval to #heads ≥ 2 and ≤ 8 as shown in Table 4.3 and Figure 4.3. Now the probability of making a type I error is about img. This rule will decrease the chances of type II error, while increasing the chances of type I error from 0.2 to 2.2%.

If we narrow our “verdict rule” interval even more to #heads ≥ 3 and ≤ 7, we get Table 4.4 and Figure 4.4.

Table 4.3 Second verdict rule scenario.

img
A bar graphical representation for # heads ?2 and ?8, where frequency out of 1000 is plotted on the y-axis on a scale of 0–300 and number of heads on the x-axis on a scale of 0–10.

Figure 4.3

Now the probability of making a type I error is about img because a fair coin will come up 0, 1, 2, 8, 9, or 10 heads about 11% of the time. We can express this uncertainty by saying either that there will be an 11% chance of a type I error, or that we are 89% confident that there will not be a type I error. Notice that this is what we came up with earlier by simply eyeballing the histogram of actual simulation outcomes in Chapter 2.

Table 4.4 Third verdict rule scenario.

img
A bar graphical representation for # heads ?3 and ?7, where frequency out of 1000 is plotted on the y-axis on a scale of 0–300 and number of heads on the x-axis on a scale of 0–10.

Figure 4.4

As noted earlier, type I error is feared most. And an 11% chance of type I error is usually seen as excessive. So, we can adopt this rule:

  1. Verdict: If outside the interval #Heads ≥ 2 and ≤ 8, Judge Coin to be Not Fair.

This gives us about a 2% chance of type I error.

From now on, we'll typically use the following threshold levels for type I error: 10% (0.10), 5% (0.05), and 1% (0.01). We'll see the effects of using various thresholds as we go along. Also as we go along we'll need to replace some common words with statistical terminology. Below are statistical terms to replace the common words we have been using.

It is important to emphasize that simulation histograms represent sampling distributions that tell us what to expect when the null hypothesis is true. We'll look at many, many sampling distribution histograms in this book. For the remainder of Part I, we'll switch from using counts as our sample statistic to using proportions as our sample statistic. The sampling distribution histogram in Figure 4.5 shows the use of proportions rather than counts on the horizontal axis.

A bar graphical representation for 1000 people each flip a fair coin 10 times, where frequency is plotted on the y-axis on a scale of 0–300 and proportion of heads on the x-axis on a scale of 0–1.

Figure 4.5