Title page image

Statistical Analysis with R For Dummies^®

Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.

For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit https://hub.wiley.com/community/support/dummies.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2017932881

ISBN: 978-1-119-33706-5; 978-1-119-33726-3 (ebk); 978-1-119-33709-6 (ebk)

Statistical Analysis with R For Dummies®

To view this book's Cheat Sheet, simply go to www.dummies.com and search for “Statistical Analysis with R For Dummies Cheat Sheet” in the Search box.

Table of Contents
1. Cover
2. Introduction
  1. About This Book
  2. Similarity with This Other For Dummies Book
  3. What You Can Safely Skip
  4. Foolish Assumptions
  5. How This Book Is Organized
  6. Icons Used in This Book
  7. Where to Go from Here
3. Part 1: Getting Started with Statistical Analysis with R
  1. Chapter 1: Data, Statistics, and Decisions
    1. The Statistical (and Related) Notions You Just Have to Know
    2. Inferential Statistics: Testing Hypotheses
  2. Chapter 2: R: What It Does and How It Does It
    1. Downloading R and RStudio
    2. A Session with R
    3. R Functions
    4. User-Defined Functions
    5. Comments
    6. R Structures
    7. Packages
    8. More Packages
    9. R Formulas
    10. Reading and Writing
4. Part 2: Describing Data
  1. Chapter 3: Getting Graphic
    1. Finding Patterns
    2. Base R Graphics
    3. Graduating to ggplot2
    4. Wrapping Up
  2. Chapter 4: Finding Your Center
    1. Means: The Lure of Averages
    2. The Average in R: mean()
    3. Medians: Caught in the Middle
    4. The Median in R: median()
    5. Statistics à la Mode
    6. The Mode in R
  3. Chapter 5: Deviating from the Average
    1. Measuring Variation
    2. Back to the Roots: Standard Deviation
    3. Standard Deviation in R
    4. Conditions, Conditions, Conditions …
  4. Chapter 6: Meeting Standards and Standings
    1. Catching Some Z’s
    2. Standard Scores in R
    3. Where Do You Stand?
    4. Summarizing
  5. Chapter 7: Summarizing It All
    1. How Many?
    2. The High and the Low
    3. Living in the Moments
    4. Tuning in the Frequency
    5. Summarizing a Data Frame
  6. Chapter 8: What’s Normal?
    1. Hitting the Curve
    2. Working with Normal Distributions
    3. A Distinguished Member of the Family
5. Part 3: Drawing Conclusions from Data
  1. Chapter 9: The Confidence Game: Estimation
    1. Understanding Sampling Distributions
    2. An EXTREMELY Important Idea: The Central Limit Theorem
    3. Confidence: It Has Its Limits!
    4. Fit to a t
  2. Chapter 10: One-Sample Hypothesis Testing
    1. Hypotheses, Tests, and Errors
    2. Hypothesis Tests and Sampling Distributions
    3. Catching Some Z’s Again
    4. Z Testing in R
    5. t for One
    6. t Testing in R
    7. Working with t-Distributions
    8. Visualizing t-Distributions
    9. Testing a Variance
    10. Working with Chi-Square Distributions
    11. Visualizing Chi-Square Distributions
  3. Chapter 11: Two-Sample Hypothesis Testing
    1. Hypotheses Built for Two
    2. Sampling Distributions Revisited
    3. t for Two
    4. Like Peas in a Pod: Equal Variances
    5. t-Testing in R
    6. A Matched Set: Hypothesis Testing for Paired Samples
    7. Paired Sample t-testing in R
    8. Testing Two Variances
    9. Working with F-Distributions
    10. Visualizing F-Distributions
  4. Chapter 12: Testing More than Two Samples
    1. Testing More Than Two
    2. ANOVA in R
    3. Another Kind of Hypothesis, Another Kind of Test
    4. Getting Trendy
    5. Trend Analysis in R
  5. Chapter 13: More Complicated Testing
    1. Cracking the Combinations
    2. Two-Way ANOVA in R
    3. Two Kinds of Variables … at Once
    4. After the Analysis
    5. Multivariate Analysis of Variance
  6. Chapter 14: Regression: Linear, Multiple, and the General Linear Model
    1. The Plot of Scatter
    2. Graphing Lines
    3. Regression: What a Line!
    4. Linear Regression in R
    5. Juggling Many Relationships at Once: Multiple Regression
    6. ANOVA: Another Look
    7. Analysis of Covariance: The Final Component of the GLM
  7. Chapter 15: Correlation: The Rise and Fall of Relationships
    1. Scatter plots Again
    2. Understanding Correlation
    3. Correlation and Regression
    4. Testing Hypotheses About Correlation
    5. Correlation in R
    6. Multiple Correlation
    7. Partial Correlation
    8. Partial Correlation in R
    9. Semipartial Correlation
    10. Semipartial Correlation in R
  8. Chapter 16: Curvilinear Regression: When Relationships Get Complicated
    1. What Is a Logarithm?
    2. What Is e?
    3. Power Regression
    4. Exponential Regression
    5. Logarithmic Regression
    6. Polynomial Regression: A Higher Power
    7. Which Model Should You Use?
6. Part 4: Working with Probability
  1. Chapter 17: Introducing Probability
    1. What Is Probability?
    2. Compound Events
    3. Conditional Probability
    4. Large Sample Spaces
    5. R Functions for Counting Rules
    6. Random Variables: Discrete and Continuous
    7. Probability Distributions and Density Functions
    8. The Binomial Distribution
    9. The Binomial and Negative Binomial in R
    10. Hypothesis Testing with the Binomial Distribution
    11. More on Hypothesis Testing: R versus Tradition
  2. Chapter 18: Introducing Modeling
    1. Modeling a Distribution
    2. A Simulating Discussion
7. Part 5: The Part of Tens
  1. Chapter 19: Ten Tips for Excel Emigrés
    1. Defining a Vector in R Is Like Naming a Range in Excel
    2. Operating on Vectors Is Like Operating on Named Ranges
    3. Sometimes Statistical Functions Work the Same Way …
    4. … And Sometimes They Don’t
    5. Contrast: Excel and R Work with Different Data Formats
    6. Distribution Functions Are (Somewhat) Similar
    7. A Data Frame Is (Something) Like a Multicolumn Named Range
    8. The sapply() Function Is Like Dragging
    9. Using edit() Is (Almost) Like Editing a Spreadsheet
    10. Use the Clipboard to Import a Table from Excel into R
  2. Chapter 20: Ten Valuable Online R Resources
    1. Websites for R Users
    2. Online Books and Documentation
8. About the Author
9. Connect with Dummies
10. End User License Agreement

Introduction

So you’re holding a statistics book. In my humble (and absolutely biased) opinion, it's not just another statistics book. It’s also not just another R book. I say this for two reasons.

First, many statistics books teach you the concepts but don't give you an easy way to apply them. That often leads to a lack of understanding. Because R is ready-made for statistics, it’s a tool for applying (and learning) statistics concepts.

Second, let’s look at it from the opposite direction: Before I tell you about one of R’s features, I give you the statistical foundation it's based on. That way, you understand that feature when you use it — and you use it more effectively.

I didn’t want to write a book that only covers the details of R and introduces some clever coding techniques. Some of that is necessary, of course, in any book that shows you how to use a software tool like R. My goal was to go way beyond that.

Neither did I want to write a statistics “cookbook”: when-faced-with-problem-category-#152-use-statistical-procedure-#346. My goal was to go way beyond that, too.

Bottom line: This book isn't just about statistics or just about R — it’s firmly at the intersection of the two. In the proper context, R can be a great tool for teaching and learning statistics, and I’ve tried to supply the proper context.

About This Book

Although the field of statistics proceeds in a logical way, I’ve organized this book so that you can open it up in any chapter and start reading. The idea is for you to find the information you're looking for in a hurry and use it immediately — whether it's a statistical concept or an R-related one.

On the other hand, reading from cover to cover is okay if you're so inclined. If you're a statistics newbie and you have to use R to analyze data, I recommend that you begin at the beginning.

Similarity with This Other For Dummies Book

You might be aware that I’ve written another book: Statistical Analysis with Excel For Dummies (Wiley). This is not a shameless plug for that book. (I do that elsewhere.)

I’m just letting you know that the sections in this book that explain statistical concepts are much like the corresponding sections in that one. I use (mostly) the same examples and, in many cases, the same words. I’ve developed that material during decades of teaching statistics and found it to be very effective. (Reviewers seem to like it, too.) Also, if you happen to have read the other book and you’re transitioning to R, the common material might just help you make the switch.

And, you know: If it ain’t broke… .

What You Can Safely Skip

Any reference book throws a lot of information at you, and this one is no exception. I intended for it all to be useful, but I didn't aim it all at the same level. So if you're not deeply into the subject matter, you can avoid paragraphs marked with the Technical Stuff icon.

As you read, you'll run into sidebars. They provide information that elaborates on a topic, but they're not part of the main path. If you're in a hurry, you can breeze past them.

Foolish Assumptions

I'm assuming this much about you:

You know how to work with Windows or the Mac. I don’t describe the details of pointing, clicking, selecting, and other actions.
You’re able to install R and RStudio (I show you how in Chapter 2) and follow along with the examples. I use the Windows version of RStudio, but you should have no problem if you’re working on a Mac.

How This Book Is Organized

I’ve organized this book into five parts and three appendixes (which you can find on this book’s companion website at www.dummies.com/go/statisticalanalysiswithr).

Part 1: Getting Started with Statistical Analysis with R

In Part 1, I provide a general introduction to statistics and to R. I discuss important statistical concepts and describe useful R techniques. If it’s been a long time since your last course in statistics or if you’ve never even had a statistics course, start with Part 1. If you have never worked with R, definitely start with Part 1.

Part 2: Describing Data

Part of working with statistics is to summarize data in meaningful ways. In Part 2, you find out how to do that. Most people know about averages and how to compute them. But that's not the whole story. In Part 2, I tell you about additional statistics that fill in the gaps, and I show you how to use R to work with those statistics. I also introduce R graphics in this part.

Part 3: Drawing Conclusions from Data

Part 3 addresses the fundamental aim of statistical analysis: to go beyond the data and help you make decisions. Usually, the data are measurements of a sample taken from a large population. The goal is to use these data to figure out what's going on in the population.

This opens a wide range of questions: What does an average mean? What does the difference between two averages mean? Are two things associated? These are only a few of the questions I address in Part 3, and I discuss the R functions that help you answer them.

Part 4: Working with Probability

Probability is the basis for statistical analysis and decision-making. In Part 4, I tell you all about it. I show you how to apply probability, particularly in the area of modeling. R provides a rich set of capabilities that deal with probability. Here's where you find them.

Part 5: The Part of Tens

Part V has two chapters. In the first, I give Excel users ten tips for moving to R. In the second, I cover ten statistical- and R-related topics that wouldn't fit in any other chapter.

Online Appendix A: More on Probability

This online appendix continues what I start in Part 4. The material is a bit on the esoteric side, so I’ve stashed it in an appendix.

Online Appendix B: Non-Parametric Statistics

Non-parametric statistics are based on concepts that differ somewhat from most of the rest of the book. In this appendix, you learn these concepts and see how to use R to apply them.

Online Appendix C: Ten Topics That Just Didn't Fit in Any Other Chapter

This is the Grab Bag appendix, where I cover ten statistical- and R-related topics that wouldn't fit in any other chapter.

Icons Used in This Book

Icons appear all over For Dummies books, and this one is no exception. Each one is a little picture in the margin that lets you know something special about the paragraph it sits next to.

This icon points out a hint or a shortcut that can help you in your work (and perhaps make you a finer, kinder, and more insightful human being).

This one points out timeless wisdom to take with you on your continuing quest for statistics knowledge.

Pay attention to the information accompanied by this icon. It's a reminder to avoid something that might gum up the works for you.

As I mention in the earlier section “What You Can Safely Skip,” this icon indicates material you can blow past if it’s just too technical. (I’ve kept this to a minimum.)

Where to Go from Here

You can start reading this book anywhere, but here are a couple of hints. Want to learn the foundations of statistics? Turn the page. Introduce yourself to R? That's Chapter 2. Want to start with graphics? Hit Chapter 3. For anything else, find it in the table of contents or the index and go for it.

In addition to what you’re reading right now, this product comes with a free access-anywhere Cheat Sheet that presents a selected list of R functions and describes what they do. To get this Cheat Sheet, visit www.dummies.com and type Statistical Analysis with R For Dummies Cheat Sheet in the search box.

Part 1

Getting Started with Statistical Analysis with R

IN THIS PART …

Find out about R’s statistical capabilities

Explore how to work with populations and samples

Test your hypotheses

Understand errors in decision-making

Determine independent and dependent variables

Chapter 1

Data, Statistics, and Decisions

IN THIS CHAPTER

check Introducing statistical concepts

check Generalizing from samples to populations

check Getting into probability

check Testing hypotheses

check Two types of error

Statistics? That’s all about crunching numbers into arcane-looking formulas, right? Not really. Statistics, first and foremost, is about decision-making. Some number-crunching is involved, of course, but the primary goal is to use numbers to make decisions. Statisticians look at data and wonder what the numbers are saying. What kinds of trends are in the data? What kinds of predictions are possible? What conclusions can we make?

To make sense of data and answer these questions, statisticians have developed a wide variety of analytical tools.

About the number-crunching part: If you had to do it via pencil-and-paper (or with the aid of a pocket calculator), you’d soon get discouraged with the amount of computation involved and the errors that might creep in. Software like R helps you crunch the data and compute the numbers. As a bonus, R can also help you comprehend statistical concepts.

Developed specifically for statistical analysis, R is a computer language that implements many of the analytical tools statisticians have developed for decision-making. I wrote this book to show how to use these tools in your work.

The Statistical (and Related) Notions You Just Have to Know

The analytical tools that that R provides are based on statistical concepts I help you explore in the remainder of this chapter. As you’ll see, these concepts are based on common sense.

Samples and populations

If you watch TV on election night, you know that one of the main events is the prediction of the outcome immediately after the polls close (and before all the votes are counted). How is it that pundits almost always get it right?

The idea is to talk to a sample of voters right after they vote. If they’re truthful about how they marked their ballots, and if the sample is representative of the population of voters, analysts can use the sample data to draw conclusions about the population.

That, in a nutshell, is what statistics is all about — using the data from samples to draw conclusions about populations.

Here’s another example. Imagine that your job is to find the average height of 10-year-old children in the United States. Because you probably wouldn’t have the time or the resources to measure every child, you’d measure the heights of a representative sample. Then you’d average those heights and use that average as the estimate of the population average.

Estimating the population average is one kind of inference that statisticians make from sample data. I discuss inference in more detail in the upcoming section “Inferential Statistics: Testing Hypotheses.”

Here’s some important terminology: Properties of a population (like the population average) are called parameters, and properties of a sample (like the sample average) are called statistics. If your only concern is the sample properties (like the heights of the children in your sample), the statistics you calculate are descriptive. If you’re concerned about estimating the population properties, your statistics are inferential.

Now for an important convention about notation: Statisticians use Greek letters (μ, σ, ρ) to stand for parameters, and English letters (, s, r) to stand for statistics. Figure 1-1 summarizes the relationship between populations and samples, and between parameters and statistics.

FIGURE 1-1: The relationship between populations, samples, parameters, and statistics.

Variables: Dependent and independent

A variable is something that can take on more than one value — like your age, the value of the dollar against other currencies, or the number of games your favorite sports team wins. Something that can have only one value is a constant. Scientists tell us that the speed of light is a constant, and we use the constant π to calculate the area of a circle.

Statisticians work with independent variables and dependent variables. In any study or experiment, you’ll find both kinds. Statisticians assess the relationship between them.

For example, imagine a computerized training method designed to increase a person’s IQ. How would a researcher find out if this method does what it’s supposed to do? First, he would randomly assign a sample of people to one of two groups. One group would receive the training method, and the other would complete another kind of computer-based activity — like reading text on a website. Before and after each group completes its activities, the researcher measures each person’s IQ. What happens next? I discuss that topic in the upcoming section “Inferential Statistics: Testing Hypotheses.”

For now, understand that the independent variable here is Type of Activity. The two possible values of this variable are IQ Training and Reading Text. The dependent variable is the change in IQ from Before to After.

A dependent variable is what a researcher measures. In an experiment, an independent variable is what a researcher manipulates. In other contexts, a researcher can’t manipulate an independent variable. Instead, he notes naturally occurring values of the independent variable and how they affect a dependent variable.

In general, the objective is to find out whether changes in an independent variable are associated with changes in a dependent variable.

In the examples that appear throughout this book, I show you how to use R to calculate characteristics of groups of scores, or to compare groups of scores. Whenever I show you a group of scores, I'm talking about the values of a dependent variable.

Types of data

When you do statistical work, you can run into four kinds of data. And when you work with a variable, the way you work with it depends on what kind of data it is. The first kind is nominal data. If a set of numbers happens to be nominal data, the numbers are labels – their values don’t signify anything. On a sports team, the jersey numbers are nominal. They just identify the players.

The next kind is ordinal data. In this data-type, the numbers are more than just labels. As the name “ordinal” might tell you, the order of the numbers is important. If I ask you to rank ten foods from the one you like best (one), to the one you like least (ten), we’d have a set of ordinal data.

But the difference between your third-favorite food and your fourth-favorite food might not be the same as the difference between your ninth-favorite and your tenth-favorite. So this type of data lacks equal intervals and equal differences.

Interval data gives us equal differences. The Fahrenheit scale of temperature is a good example. The difference between 30^o and 40^o is the same as the difference between 90^o and 100^o. So each degree is an interval.

People are sometimes surprised to find out that on the Fahrenheit scale, a temperature of 80^o is not twice as hot as 40^o. For ratio statements (“twice as much as”, “half as much as”) to make sense, “zero” has to mean the complete absence of the thing you’re measuring. A temperature of 0^o F doesn’t mean the complete absence of heat – it’s just an arbitrary point on the Fahrenheit scale. (The same holds true for Celsius.)

The fourth kind of data, ratio, provides a meaningful zero point. On the Kelvin Scale of temperature, zero means “absolute zero,” where all molecular motion (the basis of heat) stops. So 200^o Kelvin is twice as hot as 100^o Kelvin. Another example is length. Eight inches is twice as long as four inches. “Zero inches” means “a complete absence of length.”

An independent variable or a dependent variable can be either nominal, ordinal, interval, or ratio. The analytical tools you use depend on the type of data you work with.

A little probability

When statisticians make decisions, they use probability to express their confidence about those decisions. They can never be absolutely certain about what they decide. They can only tell you how probable their conclusions are.

What do we mean by probability? Mathematicians and philosophers might give you complex definitions. In my experience, however, the best way to understand probability is in terms of examples.

Here’s a simple example: If you toss a coin, what’s the probability that it turns up heads? If the coin is fair, you might figure that you have a 50-50 chance of heads and a 50-50 chance of tails. And you’d be right. In terms of the kinds of numbers associated with probability, that’s images .

Think about rolling a fair die (one member of a pair of dice). What’s the probability that you roll a 4? Well, a die has six faces and one of them is 4, so that’s images . Still another example: Select one card at random from a standard deck of 52 cards. What’s the probability that it’s a diamond? A deck of cards has four suits, so that’s images .

These examples tell you that if you want to know the probability that an event occurs, count how many ways that event can happen and divide by the total number of events that can happen. In the first two examples (heads, 4), the event you’re interested in happens only one way. For the coin, we divide one by two. For the die, we divide one by six. In the third example (diamond), the event can happen 13 ways (Ace through King), so we divide 13 by 52 (to get images ).

Now for a slightly more complicated example. Toss a coin and roll a die at the same time. What’s the probability of tails and a 4? Think about all the possible events that can happen when you toss a coin and roll a die at the same time. You could have tails and 1 through 6, or heads and 1 through 6. That adds up to 12 possibilities. The tails-and-4 combination can happen only one way. So the probability is images .

In general, the formula for the probability that a particular event occurs is

At the beginning of this section, I say that statisticians express their confidence about their conclusions in terms of probability, which is why I brought all this up in the first place. This line of thinking leads to conditional probability — the probability that an event occurs given that some other event occurs. Suppose that I roll a die, look at it (so that you don’t see it), and tell you that I rolled an odd number. What’s the probability that I’ve rolled a 5? Ordinarily, the probability of a 5 is images , but “I rolled an odd number” narrows it down. That piece of information eliminates the three even numbers (2, 4, 6) as possibilities. Only the three odd numbers (1,3, 5) are possible, so the probability is images .

What’s the big deal about conditional probability? What role does it play in statistical analysis? Read on.

Inferential Statistics: Testing Hypotheses

Before a statistician does a study, he draws up a tentative explanation — a hypothesis that tells why the data might come out a certain way. After gathering all the data, the statistician has to decide whether or not to reject the hypothesis.

That decision is the answer to a conditional probability question — what’s the probability of obtaining the data, given that this hypothesis is correct? Statisticians have tools that calculate the probability. If the probability turns out to be low, the statistician rejects the hypothesis.

Back to coin-tossing for an example: Imagine that you’re interested in whether a particular coin is fair — whether it has an equal chance of heads or tails on any toss. Let’s start with “The coin is fair” as the hypothesis.

To test the hypothesis, you’d toss the coin a number of times — let’s say, a hundred. These 100 tosses are the sample data. If the coin is fair (as per the hypothesis), you’d expect 50 heads and 50 tails.

If it’s 99 heads and 1 tail, you’d surely reject the fair-coin hypothesis: The conditional probability of 99 heads and 1 tail given a fair coin is very low. Of course, the coin could still be fair and you could, quite by chance, get a 99-1 split, right? Sure. You never really know. You have to gather the sample data (the 100 toss-results) and then decide. Your decision might be right, or it might not.

Juries make these types of decisions. In the United States, the starting hypothesis is that the defendant is not guilty (“innocent until proven guilty”). Think of the evidence as “data.” Jury-members consider the evidence and answer a conditional probability question: What’s the probability of the evidence, given that the defendant is not guilty? Their answer determines the verdict.

Null and alternative hypotheses

Think again about that coin-tossing study I just mentioned. The sample data are the results from the 100 tosses. I said that we can start with the hypothesis that the coin is fair. This starting point is called the null hypothesis. The statistical notation for the null hypothesis is H₀. According to this hypothesis, any heads-tails split in the data is consistent with a fair coin. Think of it as the idea that nothing in the sample data is out of the ordinary.

An alternative hypothesis is possible — that the coin isn't a fair one and it's loaded to produce an unequal number of heads and tails. This hypothesis says that any heads-tails split is consistent with an unfair coin. This alternative hypothesis is called, believe it or not, the alternative hypothesis. The statistical notation for the alternative hypothesis is H₁.

Now toss the coin 100 times and note the number of heads and tails. If the results are something like 90 heads and 10 tails, it's a good idea to reject H₀. If the results are around 50 heads and 50 tails, don't reject H₀.

Similar ideas apply to the IQ example I gave earlier. One sample receives the computer-based IQ training method, and the other participates in a different computer-based activity — like reading text on a website. Before and after each group completes its activities, the researcher measures each person’s IQ. The null hypothesis, H₀, is that one group’s improvement isn't different from the other. If the improvements are greater with the IQ training than with the other activity — so much greater that it's unlikely that the two aren't different from one another — reject H₀. If they're not, don't reject H₀.

Notice that I did not say “accept H₀.” The way the logic works, you never accept a hypothesis. You either reject H₀ or don't reject H₀. In a jury trial, the verdict is either “guilty” (reject the null hypothesis of “not guilty”) or “not guilty” (don’t reject H₀). “Innocent” (acceptance of the null hypothesis) is not a possible verdict.

Notice also that in the coin-tossing example I said “around 50 heads and 50 tails.” What does around mean? Also, I said that if it's 90-10, reject H₀. What about 85-15? 80-20? 70-30? Exactly how much different from 50-50 does the split have to be for you to reject H₀? In the IQ training example, how much greater does the IQ improvement have to be to reject H₀?

I won't answer these questions now. Statisticians have formulated decision rules for situations like this, and we'll explore those rules throughout the book.

Two types of error

Whenever you evaluate data and decide to reject H₀ or to not reject H₀, you can never be absolutely sure. You never really know the “true” state of the world. In the coin-tossing example, that means you can’t be certain if the coin is fair or not. All you can do is make a decision based on the sample data. If you want to know for sure about the coin, you have to have the data for the entire population of tosses — which means you have to keep tossing the coin until the end of time.

Because you're never certain about your decisions, you can make an error either way you decide. As I mention earlier, the coin could be fair and you just happen to get 99 heads in 100 tosses. That's not likely, and that's why you reject H₀ if that happens. It's also possible that the coin is biased, yet you just happen to toss 50 heads in 100 tosses. Again, that’s not likely and you don’t reject H₀ in that case.

Although those errors are not likely, they are possible. They lurk in every study that involves inferential statistics. Statisticians have named them Type I errors and Type II errors.

If you reject H₀ and you shouldn't, that's a Type I error. In the coin example, that's rejecting the hypothesis that the coin is fair, when in reality it is a fair coin.

If you don't reject H₀ and you should have, that's a Type II error. It happens if you don't reject the hypothesis that the coin is fair, and in reality it's biased.

How do you know if you've made either type of error? You don’t — at least not right after you make the decision to reject or not reject H₀. (If it's possible to know, you wouldn't make the error in the first place!) All you can do is gather more data and see if the additional data is consistent with your decision.

If you think of H₀ as a tendency to maintain the status quo and not interpret anything as being out of the ordinary (no matter how it looks), a Type II error means you’ve missed out on something big. In fact, some iconic mistakes are Type II errors.

Here’s what I mean. On New Year’s day in 1962, a rock group consisting of three guitarists and a drummer auditioned in the London studio of a major recording company. Legend has it that the recording executives didn’t like what they heard, didn’t like what they saw, and believed that guitar groups were on the way out. Although the musicians played their hearts out, the group failed the audition.

Who was that group? The Beatles!

And that’s a Type II error.

Chapter 2

R: What It Does and How It Does It

IN THIS CHAPTER

check Getting R and RStudio

check Working with RStudio

check Learning R functions

check Learning R structures

check Working with packages

check Forming R formulas

check Reading and writing files

R is a computer language. It’s a tool for doing the computation and number-crunching that set the stage for statistical analysis and decision-making. An important aspect of statistical analysis is to present the results in a comprehensible way. For this reason, graphics is a major component of R.

Ross Ihaka and Robert Gentleman developed R in the 1990s at the University of Auckland, New Zealand. Supported by the Foundation for Statistical Computing, R is getting more and more popular by the day.

RStudio is an open source integrated development environment (IDE) for creating and running R code. It’s available in versions for Windows, Mac, and Linux. Although you don’t need an IDE in order to work with R, RStudio makes life a lot easier.

Downloading R and RStudio

First things first. Download R from the Comprehensive R Archive Network (CRAN). In your browser, type this address if you work in Windows:

cran.r-project.org/bin/windows/base/

Type this one if you work on the Mac:

cran.r-project.org/bin/macosx/

Click the link to download R. This puts the win.exe file in your Windows computer, or the .pkg file in your Mac. In either case, follow the usual installation procedures. When installation is complete, Windows users see an R icon on their desktop, Mac users see it in their Application folder.

Both URLs provides helpful links to FAQs. The Windows-related URL also links to “Installation and other instructions.”

Now for RStudio. Here’s the URL:

www.rstudio.com/products/rstudio/download

Click the link for the installer for your computer, and again follow the usual installation procedures.

After the RStudio installation is finished, click the RStudio icon to open the window shown in Figure 2-1.

FIGURE 2-1: RStudio, immediately after you install it and click on its icon.

If you already have an older version of RStudio and you go through this installation procedure, the install updates to the latest version (and you don’t have to uninstall the older version).

The large Console pane on the left runs R code. One way to run R code is to type it directly into the Console pane. I show you another way in a moment.

The other two panes provide helpful information as you work with R. The Environment and History pane is in the upper right. The Environment tab keeps track of the things you create (which R calls objects) as you work with R. The History tab tracks R code that you enter.

Get used to the word object. Everything in R is an object.

The Files, Plots, Packages, and Help tabs are in the pane in the lower right. The Files tab shows files you create. The Plots tab holds graphs you create from your data. The Packages tab shows add-ons (called packages) you downloaded as part of the R installation. Bear in mind that “downloaded” doesn’t mean “ready to use.” To use a package’s capabilities, one more step is necessary – and believe me – you’ll want to use packages.

Figure 2-2 shows the Packages tab. The packages are in either the user library (which you can see in the figure) or the system library (which you have to scroll down to). I discuss packages later in this chapter.

The Help tab, shown in Figure 2-3, provides links to a wealth of information about R and RStudio.

To tap into the full power of RStudio as an IDE, click the larger of the two icons in the upper right corner of the Console pane. That changes the appearance of RStudio so that it looks like Figure 2-4.

FIGURE 2-4: RStudio, after you click the larger icon in the upper right corner of the Console pane.

The top of the Console pane relocates to the lower left. The new pane in the upper left is the Scripts pane. You type and edit code in the Scripts pane and press Ctrl+R (Command+Enter on the Mac), and then the code executes in the Console pane.

Ctrl+Enter works just like Ctrl+R. You can also select

Code ⇒ Run Selected Line(s)

A Session with R

Before you start working, select

File ⇒ Save As …

and then save as My First R Session. This relabels the tab in the Scripts pane with the name of the file and adds the .R extension. This also causes the filename (along with the .R extension) to appear on the Files tab.

The working directory

What exactly does R save, and where does R save it? What R saves is called the workspace, which is the environment you’re working in. R saves the workspace in the working directory. In Windows, the default working directory is

C:\Users\<User Name>\Documents

If you ever forget the path to your working directory, type

> getwd()

in the Console pane, and R returns the path onscreen.

In the Console pane, you don’t type the right-pointing arrowhead at the beginning of the line. That’s a prompt.

My working directory looks like this:

> getwd()
[1] “C:/Users/Joseph Schmuller/Documents

Note which way the slashes are slanted. They’re opposite to what you typically see in Windows file paths. This is because R uses \ as an escape character, meaning that whatever follows the \ means something different from what it usually means. For example, \t in R means Tab key.

You can also write a Windows file path in R as

C:\\Users\\<User Name>\\Documents

If you like, you can change the working directory:

> setwd(<file path>)

Another way to change the working directory is to select

Session ⇒ Set Working Directory ⇒ Choose Directory

So let’s get started, already

And now for some R! In the Script window, type

x <- c(3,4,5)

and then Ctrl+R.

That puts the following line into the Console pane:

> x <- c(3,4,5)

As I mention in an earlier Tip, the right-pointing arrowhead (the greater-than sign) is a prompt that R supplies in the Console pane. You don’t see it in the Scripts pane.

What did R just do? The arrow sign says that x gets assigned whatever is to the right of the arrow sign. So the arrow-sign is R’s assignment operator.

To the right of the arrow sign, the c stands for concatenate, a fancy way of saying “Take whatever items are in the parentheses and put them together.” So the set of numbers 3, 4, 5 is now assigned to x.

R refers to a set of numbers like this as a vector. (I tell you more on this in the later “R Structures” section.)

You can read that line of R code as “x gets the vector 3, 4, 5.”

Type x into the Scripts pane and press Ctrl+R, and here’s what you see in the Console pane:

> x
[1] 3 4 5

The 1 in square brackets is the label for the first value in the line of output. Here you have only one value, of course. What happens when R outputs many values over many lines? Each line gets a bracketed numeric label, and the number corresponds to the first value in the line. For example, if the output consists of 21 values and the 18th value is the first one on the second line, the second line begins with [18].

Creating the vector x causes the Environment tab to look like Figure 2-5.

FIGURE 2-5: The RStudio Environment tab, after creating the vector x.

Another way to see the objects in the environment is to type

> ls()

Now you can work with x. First, add all numbers in the vector. Typing

sum(x)

in the Scripts pane (remember to follow with Ctrl+R) executes the following line in the Console pane:

> sum(x)
[1] 12

How about the average of the numbers in the vector x?

That’s

mean(x)

in the Scripts pane, which (when followed by Ctrl+R) executes to

> mean(x)
[1] 4

in the Console pane.

As you type in the Scripts pane or in the Console pane, you’ll notice that helpful information pops up. As you gain experience with RStudio, you’ll learn how to use that information.

As I show you in Chapter 5, variance is a measure of how much a set of numbers differs from their mean. What exactly is variance, and how do you calculate it? I’ll leave that for Chapter 5. For now, here’s how you use R to calculate variance:

> var(x)
[1] 1

In each case, you type a command and R evaluates it and displays the result.

Figure 2-6 shows what RStudio looks like after all these commands.

FIGURE 2-6: RStudio after creating and working with a vector.

To end a session, select File ⇒ Quit Session or press Ctrl+Q. As Figure 2-7 shows, a dialog box opens and asks what you want to save from the session. Saving the selections enables you to reopen the session where you left off the next time you open RStudio (although the Console pane doesn’t save your work).

FIGURE 2-7: The Quit R Session dialog box.

Pretty helpful, this RStudio.

Moving forward, most of the time I don’t say “Type this R code into the Scripts pane and press Ctrl+Enter” whenever I take you through an example. I just show you the code and its output, as in the var() example.

Also, sometimes I show code with the > prompt, and sometimes without. Generally, I show the prompt when I want you to see R code and its results. I don’t show the prompt when I just want you to see R code that I create in the Scripts pane.

Missing data

In the statistical analysis examples I provide, I typically deal with best-case scenarios in which the data sets are in good shape and have all the data they’re supposed to have.

In the real world, however, things don’t always go so smoothly. Oftentimes, you encounter data sets that have values missing for one reason or another. R denotes a missing value as NA (for Not Available).

For example, here is some data (from a much larger data set) on the luggage capacity, in cubic feet, of nine vehicles:

capacity <- c(14,13,14,13,16,NA,NA,20,NA)

Three of the vehicles are vans, and the term luggage capacity doesn’t apply to them — hence, the three instances of NA. Here’s what happens when you try to find the average of this group:

> mean(capacity)
[1] NA

To find the mean, you have to remove the NAs before you calculate:

> mean(capacity, na.rm=TRUE)
[1] 15

So the rm in na.rm means “remove” and =TRUE means “get it done.”

Just in case you ever have to check a set of scores for missing data, the is,na() function does that for you:

> is.na(capacity)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE

R Functions

In the preceding section, I use c(), sum(), mean(), and var(). These are examples of functions built into R. Each one consists of a function name immediately followed by parentheses. Inside the parentheses are the arguments. In this context, “argument” doesn’t mean “disagreement,” “confrontation,” or anything like that. It’s just the math term for whatever a function operates on.

Even if a function takes no arguments, you still include the parentheses.

The four R functions I’ve shown you are pretty simple in terms of their arguments and their output. As you work with R, however, you encounter functions that take more than one argument.

R provides a couple of ways for you to deal with multiargument functions. One way is to list the arguments in the order in which they appear in the function’s definition. R calls this positional matching.

Here’s what I mean. The function substr() takes three arguments. The first is a string of characters like “abcdefg”, which R refers to as a character vector. The second argument is a start position within the string (1 is the first position, 2 is the second position, and so on). The third is a stop position within the string (a number greater than or equal to the start position). In fact, if you type substr into the Scripts pane, you see a helpful pop-up message that looks like this:

substr(x, start, stop)
Extract or replace substrings in a character vector


where x stands for the character vector.
This function returns the substring, which consists of the characters between the start and stop positions.
Here’s an example: 

> substr("abcdefg",2,4)
[1] "bcd"
What happens if you interchange the 2 and the 4? 

> substr(“abcdefg”,4,2)
[1] “”
This result is completely understandable: No substring can start at the fourth position and stop at the second position.
But if you name the arguments, it doesn’t matter how you order them: 

> substr(“abcdefg”,stop=4,start=2)
[1] “bcd”
Even this works: 

> substr(stop=4, start=2,“abcdefg”)
[1] “bcd”
So when you use a function, you can place its arguments out of order, if you name them. R calls this keyword matching, which comes in handy when you use an R function that has many arguments. If you can’t remember their order, just use their names and the function works.
 If you ever need help for a particular function — substr(), for example — type ?substr and watch helpful information appear on the Help tab.


 User-Defined Functions 
Strictly speaking, this is not a book on R programming. For completeness, though, I thought I’d at least let you know that you can create your own functions in R, and show you the fundamentals of creating one.
The form of an R function is 

myfunction <- function(argument1, argument2, … ){
             statements
             return(object)
             }
Here’s a simple function for computing the sum of the squares of three numbers: 

sumofsquares <- function(x,y,z){
  sumsq <- sum(c(x^2,y^2,z^2))
  return(sumsq)
}
Type that snippet into the Scripts pane and highlight it. Then press Ctrl+R. The following snippet appears in the Console pane: 

> sumofsquares <- function(x,y,z ){
+   sumsq <- sum(c(x^2,y^2,z^2))
+   return(sumsq)
+ }
Each plus-sign is a continuation prompt. It just indicates that a line continues from the preceding line.
And here’s how to use the function: 

> sumofsquares(3,4,5)
[1] 50 

Comments 
A comment is a way of annotating code. Begin a comment with the # symbol, which of course is an octothorpe. (What’s that you say? “Hashtag”? Surely you jest.) This symbol tells R to ignore everything to the right of it.
Comments are very helpful for someone who has to read code that you’ve written. For example: 

sumofsquares <- function(x,y,z){ # list the arguments
  sumsq <- sum(c(x^2,y^2,z^2))  # perform the operations
  return(sumsq)    # return the value
}

 As you might imagine, writing R functions can encompass WAY more than I’ve laid out here. To learn more, check out  by Andrie de Vries and Joris Meys (John Wiley & Sons).