R Projects For Dummies®
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright © 2018 by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions
.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit https://hub.wiley.com/community/support/dummies
.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com
. For more information about Wiley products, visit www.wiley.com
.
Library of Congress Control Number: 2017964027
ISBN: 978-1-119-44618-7; 978-1-119-44617-0 (ebk); 978-1-119-44616-3 (ebk)
If you’re like me, you think the best way to learn is by doing. Don’t just read about something — practice it! If you want to be a builder, then build. If you want to be a writer, then write. If you want to be a carpenter, then carpenter. (Yes, that noun and verb are the same. Carpent is not a word.)
I based this book on that learning-by-doing philosophy. My objective is for you to expand your R skill set by using R to complete projects in a variety of areas, and to learn something about those areas, too.
Even with those noble intentions, a book like this one can fall into a trap. It can quickly become a cookbook: Use this package, use these functions, create a graphic — and presto, you’ve finished a project and it’s time to move on.
I didn’t want to write that book. Instead, beginning in Part 2 (which is where the projects start), each chapter does more than just walk you through a project. First, I show you some background material about the subject area, and then (in most chapters) you work through a scaled-down project in that area to get your feet wet, and then you complete a larger project.
But a chapter doesn’t end there. At the end of each chapter, you’ll find a Suggested Project that challenges you to apply your newly minted skills. For each of those, I supply just enough information to get you started. (Wherever necessary, I include tips about potential pitfalls.)
Along the way, you’ll also encounter Quick Suggested Projects. These are based on tweaks to projects you’ve already completed, and they present additional challenges to your growing skill set.
One more thing: Every subject area could be the basis for an entire book, so I can only scratch the surface of each one. Chapter 17 directs you toward resources that provide more information.
I’ve organized this book into six parts.
Part 1 is all about R and RStudio. I discuss R functions, structures, and packages, and I show you how to create a variety of graphics.
The projects begin in Part 2, where you learn to create applications that respond to users. I discuss the shiny
package for working with web browsers, and the shinydashboard
package for creating dashboards.
This is the longest part of the book. I begin by telling you about the University of California–Irvine Machine Learning Repository, which provides the data sets for the projects. I also discuss the rattle
package for creating machine learning applications. The projects cover decision trees, random forests, support vector machines, k-means clustering, and neural networks.
The two projects in Part 4 deal with larger data sets than you encounter earlier in the book. The first project is a customer segmentation analysis of over 300,000 customers of an online retailer. A follow-up analysis applies machine learning. The second project analyzes a data set of more than 500,000 airline flights.
Two projects are in Part 5. The first is to plot the location (along with other information) of airports in one of the US states. The second shows you how to combine an animated image with a stationary one.
The first chapter in Part 6 provides information about useful packages that can help you with future projects. The second tells you where to learn more about the subject areas of this book.
Any reference book throws a lot of information at you, and this one is no exception. I intended it all to be useful, but I didn't aim it all at the same level. So if you're not deeply into the subject matter, you can avoid paragraphs marked with the Technical Stuff icon, and you can also skip the sidebars.
I'm assuming that you
You’ll find icons in all For Dummies books, and this one is no exception. Each one is a little picture in the margin that lets you know something special about the paragraph it sits next to.
In addition to what you’re reading right now, this product comes with a free access-anywhere Cheat Sheet that presents a selected list of R functions and describes what they do. To get this Cheat Sheet, visit www.dummies.com
and type R Projects For Dummies Cheat Sheet in the Search box.
You can start the book anywhere, but here are a couple of hints. Want to introduce yourself to R and packages? You’ll find the info in Chapters 1 and 2. Want to start with graphics? Hit Chapter 3. For anything else, find it in the table of contents or in the index and go for it.
If you’re a cover-to-cover reader, turn the page… .
Part 1
IN THIS PART …
Learn about R and RStudio
Understand R Functions and Structures
Create your own R Functions
Examine data
Use base R graphics
Graduate to ggplot2
graphics
Chapter 1
IN THIS CHAPTER
Getting R and RStudio on your computer
Plunging into a session with R
Working with R functions
Working with R structures
So you’re ready to journey into the wonderful world of R! Designed by and for statisticians and data scientists, R has a short but illustrious history.
In the 1990s, Ross Ihaka and Robert Gentleman developed R at the University of Auckland, New Zealand. The Foundation for Statistical Computing supports R, which is growing more popular by the day.
If you don’t already have R on your computer, the first thing to do is to download R and install it.
You’ll find the appropriate software on the website of the Comprehensive R Archive Network (CRAN). In your browser, type this web address if you work in Windows:
cran.r-project.org/bin/windows/base
Type this one if you work on the Mac:
cran.r-project.org/bin/macosx
Click the link to download R. This puts a win.exe
file in your Windows computer or a pkg
file in your Mac. In either case, follow the usual installation procedures. When installation is complete, Windows users see two R icons on their desktop, one for 32-bit processors and one for 64-bit processors (pick the one that's right for you). Mac users see an R icon in their Application
folder.
Working with R is a lot easier if you do it through an application called RStudio. Computer honchos refer to RStudio as an IDE (Integrated Development Environment). Think of it as a tool that helps you write, edit, run, and keep track of your R code, and as an environment that connects you to a world of helpful hints about R.
Here’s the web address for this terrific tool:
www.rstudio.com/products/rstudio/download
Click the link for the installer for your computer’s operating system — Windows, Mac, or a flavor of Linux — and again follow the usual installation procedures.
After you finish installing R and RStudio, click on your brand-new RStudio icon to open the window shown in Figure 1-1.
The large Console pane on the left runs R code. One way to run R code is to type it directly into the Console pane. I show you another in a moment.
The other two panes provide helpful information as you work with R. The Environment/History pane is in the upper right. The Environment tab keeps track of the things you create (which R calls objects) as you work with R. The History tab tracks R code that you enter.
Figure 1-2 shows the Packages tab. I discuss packages later in this chapter.
The Help tab, shown in Figure 1-3, links you to a wealth of information about R and RStudio.
To tap into the full power of RStudio as an IDE, click the icon in the upper right corner of the Console pane. That changes the appearance of RStudio so that it looks like Figure 1-4.
The Console pane relocates to the lower left. The new pane in the upper left is the Scripts pane. You type and edit code in the Scripts pane by pressing Ctrl+R (Command+Enter on the Mac), and then the code executes in the Console pane.
Before you start working, select File ⇒ Save As from the main menu and then save your work file as My First R Session. This relabels the tab in the Scripts pane with the name of the file and adds the .R
extension. This also causes the filename (along with the .R
extension) to appear on the Files tab.
What exactly does R save, and where does R save it? What R saves is called the workspace, which is the environment you're working in. R saves the workspace in the working directory. In Windows, the default working directory is
C:\Users\<User Name>\Documents
On a Mac, it’s
/Users/<User Name>
If you ever forget the path to your working directory, type
> getwd()
in the Console pane, and R returns the path onscreen.
My working directory looks like this:
> getwd()
[1] "C:/Users/Joseph Schmuller/Documents
Note the direction the slashes are slanted. They’re opposite to what you typically see in Windows file paths. This is because R uses \
as an escape character — whatever follows the \
means something different from what it usually means. For example, \t
in R means Tab key.
C:\\Users\\<User Name>\\Documents
If you like, you can change the working directory:
> setwd(<file path>)
Another way to change the working directory is to select Session ⇒ Set Working Directory ⇒ Choose Directory from the main menu.
Let's get down to business and start writing R code. In the Scripts pane, type
x <- c(5,10,15,20,25,30,35,40)
and then press Ctrl+R.
That puts this line into the Console pane:
> x <- c(5,10,15,20,25,30,35,40)
As I say in an earlier Tip paragraph, the right-pointing arrowhead (the greater-than sign) is a prompt that R puts in the Console pane. You don’t see it in the Scripts pane.
Here’s what R just did: The arrow-sign says that x
gets assigned whatever is to the right of the arrow-sign. Think of the arrow-sign as R’s assignment operator. So the set of numbers 5, 10, 15, 20 … 40 is now assigned to x.
You can read that line of code as “x gets the vector 5, 10, 15, 20.”
Type x into the Scripts pane and press Ctrl+R, and here’s what you see in the Console pane:
> x
[1] 5 10 15 20 25 30 35 40
The 1 in square brackets is the label for the first line of output. So this signifies that 5 is the first value.
Here you have only one line, of course. What happens when R outputs many values over many lines? Each line gets a bracketed numeric label, and the number corresponds to the first value in the line. For example, if the output consists of 23 values and the eighteenth value is the first one on the second line, the second line begins with [18].
Creating the vector x
causes the Environment tab to look like Figure 1-5.
[1] "x"
Now you can work with x
. First, add all numbers in the vector. Typing sum(x) in the Scripts pane (be sure to follow with Ctrl+R) executes the following line in the Console pane:
> sum(x)
[1] 180
How about the average of the numbers in vector x
?
That would involve typing mean(x) in the Scripts pane, which (when followed by Ctrl+R) executes
> mean(x)
[1] 22.5
in the Console pane.
Variance is a measure of how much a set of numbers differ from their mean. Here's how to use R to calculate variance:
> var(x)
[1] 150
What, exactly, is variance and what does it mean? (Shameless plug alert.) For the answers to these and numerous other questions about statistics and analysis, read one of the most classic works in the English language: Statistical Analysis with R For Dummies (written by yours truly and published by Wiley).
After R executes all these commands, the History tab looks like Figure 1-6.
To end a session, select File ⇒ Quit Session from the main menu or press Ctrl+Q. As Figure 1-7 shows, a dialog box opens and asks what you want to save from the session. Saving the selections enables you to reopen the session where you left off the next time you open RStudio (although the Console pane doesn’t save your work).
The examples in the preceding section use c()
, sum()
, and var()
. These are three functions built into R. Each one consists of a function name immediately followed by parentheses. Inside the parentheses are arguments. In the context of a function, argument doesn't mean “debate” or “disagreement” or anything like that. It’s the math name for whatever a function operates on.
The functions in the examples I showed you are pretty simple: Supply an argument, and each one gives you a result. Some R functions, however, take more than one argument.
R has a couple of ways for you to deal with multi-argument functions. One way is to list the arguments in the order that they appear in the function’s definition. R calls this positional mapping.
Here’s an example. Remember when I created the vector x?
x <- c(5,10,15,20,25,30,35,40)
Another way to create a vector of those numbers is with the function seq()
:
> y <- seq(5,40,5)
> y
[1] 5 10 15 20 25 30 35 40
Think of seq()
as creating a “sequence.” The first argument to seq()
is the number to start the sequence from (5). The second argument is the number that ends the sequence — the number the sequence goes to (40). The third argument is the increment of the sequence — the amount the sequence increases by (5).
If you name the arguments, it doesn't matter how you order them:
> z <- seq(to=40,by=5,from=5)
> z
[1] 5 10 15 20 25 30 35 40
So when you use a function, you can place its arguments out of order, if you name them. R calls this keyword matching. This comes in handy when you use an R function that has many arguments. If you can’t remember their order, use their names, and the function works.
R enables you to create your own functions, and here are the fundamentals on how to do it.
The form of an R function is
myfunction <- function(argument1, argument2, …){
statements
return(object)
}
Here’s a function for dealing with right triangles. Remember them? A right triangle has two sides that form a right angle, and a third side called a hypotenuse. You might also remember that a guy named Pythagoras showed that if one side has length a and the other side has length b, the length of the hypotenuse, c, is
So here’s a simple function called hypotenuse()
that takes two numbers a
and b
, (the lengths of the two sides of a right triangle) and returns c
, the length of the hypotenuse:
hypotenuse <- function(a,b){
hyp <- sqrt(a^2+b^2)
return(hyp)
}
Type that code snippet into the Scripts pane and highlight it. Then press Ctrl+Enter. Here's what appears in the Console pane:
> hypotenuse <- function(a,b){
+ hyp <- sqrt(a^2+b^2)
+ return(hyp)
+ }
Each plus sign is a continuation prompt. It just indicates that a line continues from the preceding line.
And here’s how to use the function:
> hypotenuse(3,4)
[1] 5
A comment is a way of annotating code. Begin a comment with the # symbol, which, as everyone knows, is called an octothorpe. (Wait. What? “Hashtag?” Get atta here!) This symbol tells R to ignore everything to the right of it.
Comments help someone who has to read the code you’ve written. For example:
hypotenuse <- function(a,b){ # list the arguments
hyp <- sqrt(a^2+b^2) # perform the computation
return(hyp) # return the value
}
Here’s a heads-up: I don’t typically add comments to lines of code in this book. Instead, I provide detailed descriptions. In a book like this, I feel it’s the best way to get the message across.
As I mention in the “R Functions” section, earlier in this chapter, an R function can have many arguments. An R function can also have many outputs. To understand the possible inputs and outputs, you must understand the structures that R works with.
The vector is the fundamental structure in R. I show it to you in earlier examples. It’s an array of elements of the same type. The data elements in a vector are called components.
To create a vector, use the function c()
, as I do in the earlier example:
x <- c(5,10,15,20,25,30,35,40)
In the vector x
, of course, the components are numbers.
In a character vector, the components are quoted text strings:
> beatles <- c("john","paul","george","ringo")
It's also possible to have a logical vector, whose components are TRUE
and FALSE
, or the abbreviations T
and F
:
> w <- c(T,F,F,T,T,F)
To refer to a specific component of a vector, follow the vector name with a bracketed number:
> beatles[2]
[1] "paul"
Within the brackets, you can use a colon (:
) to refer to two consecutive components:
> beatles[2:3]
[1] "paul" "george"
Want to refer to nonconsecutive components? That's a bit more complicated, but doable via c()
:
> beatles[c(2,4)]
[1] "paul" "ringo"
In addition to c()
, R provides two shortcut functions for creating numerical vectors. One, seq()
, I showed you earlier:
> y <- seq(5,40,5)
> y
[1] 5 10 15 20 25 30 35 40
Without the third argument, the sequence increases by 1:
> y <- seq(5,40)
> y
[1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[20] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
> y <- 5:40
> y
[1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[20] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Another function, rep()
, creates a vector of repeating values:
> quadrifecta <- c(7,8,4,3)
> repeated_quadrifecta <- rep(quadrifecta,3)
> repeated_quadrifecta
[1] 7 8 4 3 7 8 4 3 7 8 4 3
You can also supply a vector as the second argument:
> rep_vector <-c(1,2,3,4)
> repeated_quadrifecta <- rep(quadrifecta,rep_vector)
The vector specifies the number of repetitions for each element. So here's what happens:
> repeated_quadrifecta
[1] 7 8 8 4 4 4 3 3 3 3
The first element repeats once; the second, twice; the third, three times; and the fourth, four times.
You can use append()
to add an item at the end of a vector:
> xx <- c(3,4,5)
> xx
[1] 3 4 5
> xx <- append(xx,6)
> xx
[1] 3 4 5 6
and you can use prepend()
to add an item at the beginning of a vector:
> xx <- prepend(xx,2)
> xx
[1] 2 3 4 5 6
How many items are in a vector? That's
> length(xx)
[1] 5
A matrix is a 2-dimensional array of data elements of the same type. You can have a matrix of numbers:
5 |
30 |
55 |
80 |
10 |
35 |
60 |
85 |
15 |
40 |
65 |
90 |
20 |
45 |
70 |
95 |
25 |
50 |
75 |
100 |
or a matrix of character strings:
“john” |
“paul” |
“george” |
“ringo” |
“groucho” |
“harpo” |
“chico” |
“zeppo” |
“levi” |
“duke” |
“larry” |
“obie” |
The numbers are a 5 (rows) X 4 (columns) matrix. The character strings matrix is 3 X 4.
To create this particular 5 X 4 numerical matrix, first create the vector of numbers from 5 to 100 in steps of 5:
> num_matrix <- seq(5,100,5)
Then you use R’s dim()
function to turn the vector into a 2-dimensional matrix:
> dim(num_matrix) <- c(5,4)
> num_matrix
[,1] [,2] [,3] [,4]
[1,] 5 30 55 80
[2,] 10 35 60 85
[3,] 15 40 65 90
[4,] 20 45 70 95
[5,] 25 50 75 100
Note how R displays the bracketed row numbers along the side, and the bracketed column numbers along the top.
Transposing a matrix interchanges the rows with the columns. The t()
function takes care of that:
> t(num_matrix)
[,1] [,2] [,3] [,4] [,5]
[1,] 5 10 15 20 25
[2,] 30 35 40 45 50
[3,] 55 60 65 70 75
[4,] 80 85 90 95 100
The function matrix()
gives you another way to create matrices:
> num_matrix <- matrix(seq(5,100,5),nrow=5)
> num_matrix
[,1] [,2] [,3] [,4]
[1,] 5 30 55 80
[2,] 10 35 60 85
[3,] 15 40 65 90
[4,] 20 45 70 95
[5,] 25 50 75 100
If you add the argument byrow=T
, R fills the matrix by rows, like this:
> num_matrix <- matrix(seq(5,100,5),nrow=5,byrow=T)
> num_matrix
[,1] [,2] [,3] [,4]
[1,] 5 10 15 20
[2,] 25 30 35 40
[3,] 45 50 55 60
[4,] 65 70 75 80
[5,] 85 90 95 100
How do you refer to a specific matrix component? You type the matrix name and then, in brackets, the row number, a comma, and the column number:
> num_matrix[5,4]
[1] 100
To refer to a whole row (like the third one):
> num_matrix[3,]
[1] 45 50 55 60
and to a whole column (like the second one):
> num_matrix[,2]
[1] 10 30 50 70 90
Although it's a column, R displays it as a row in the Console pane.
In R, a list is a collection of objects that aren’t necessarily the same type. Suppose you’re putting together some information on the Beatles:
> beatles <- c("john","paul","george","ringo")
One piece of important information might be each Beatle’s age when he joined the group. John and Paul started singing together when they were 17 and 15, respectively, and 14–year-old George joined them soon after. Ringo, a late arrival, became a Beatle when he was 22. So
> ages <- c(17,15,14,22)
To combine the information into a list, you use the list()
function:
> beatles_info <-list(names=beatles,age_joined=ages)
Naming each argument (names, age_joined
) causes R to use those names as the names of the list components.
And here's what the list looks like:
> beatles_info
$names
[1] "john" "paul" "george" "ringo"
$age_joined
[1] 17 15 14 22
R uses the dollar sign ($
) to indicate each component of the list. If you want to refer to a list component, you type the name of the list, the dollar sign, and the component name:
> beatles_info$names
[1] "john" "paul" "george" "ringo"
And to zero in on a particular Beatle, like the fourth one? You can probably figure out that it’s
> beatles_info$names[4]
[1] "ringo"
R also allows you to use criteria inside the brackets. For example, to refer to members of the Fab Four who were older than 16 when they joined:
> beatles_info$names[beatles_info$age_joined > 16]
[1] "john" "ringo"
A list is a good way to collect data. A data frame is even better. Why? When you think about data for a group of individuals, you typically think in terms of rows that represent the individuals and columns that represent the data variables. And that’s a data frame. If the terms data set or data matrix come to mind, you’ve got the right idea.
Here’s an example. Suppose I have a set of six people:
> name <- c("al","barbara","charles","donna","ellen","fred") and that I have each person’s height (in inches) and weight (in pounds):
> height <- c(72,64,73,65,66,71)
> weight <- c(195,117,205,122,125,199)
I also tabulate each person’s gender:
> gender <- c("M","F","M","F","F","M")
Before I show you how to combine all these vectors into a data frame, I have to show you one more thing. The components of the gender
vector are character strings. For purposes of data summary and analysis, it's a good idea to turn them into categories — the Male category and the Female category. To do this, I use the factor()
function:
> factor_gender <-factor(gender)
> factor_gender
[1] M F M F F M
Levels: F M
In the last line of output, Levels
is the term that R uses for “categories.”
The function data.frame()
works with the vectors to create a data frame:
> d <- data.frame(name,factor_gender,height,weight)
> d
name factor_gender height weight
1 al M 72 195
2 barbara F 64 117
3 charles M 73 205
4 donna F 65 122
5 ellen F 66 125
6 fred M 71 199
Want to know the height of the third person?
> d[3,3]
[1] 73
How about all the information for the fifth person:
> d[5,]
name factor_gender height weight
5 ellen F 66 125
Like lists, data frames use the dollar sign. In this context, the dollar sign identifies a column:
> d$height
[1] 72 64 73 65 66 71
You can calculate statistics, like the average height:
> mean(d$height)
[1] 68.5
As is the case with lists, you can put criteria inside the brackets. This is often done with data frames, to summarize and analyze data within categories. To find the average height of the females:
> mean(d$height[d$factor_gender == "F"])
[1] 65
The double equal sign (==
) in the brackets is a logical operator. Think of it as “if d$factor_gender is equal to “F”.
For example,
> with(d,mean(height[factor_gender == "F"]))
is equivalent to
> mean(d$height[d$factor_gender == "F"])
How many rows are in a data frame?
> nrow(d)
[1] 6
And how many columns?
> ncol(d)
[1] 4
To add a column to a data frame, I use cbind()
. Begin with a vector of scores:
> aptitude <- c(35,20,32,22,18,15)
Then add that vector as a column:
> d.apt <- cbind(d,aptitude)
> d.apt
name factor_gender height weight aptitude
1 al M 72 195 35
2 barbara F 64 117 20
3 charles M 73 205 32
4 donna F 65 122 22
5 ellen F 66 125 18
6 fred M 71 199 15
Like many programming languages, R provides a way to iterate through its structures to get things done. R’s way is called the for loop. And, like many languages, R gives you a way to test against a criterion: the if statement.
The general format of a for
loop is
for counter in start:end{
statement 1
statement n
}
As you might imagine, counter
tracks the iterations.
The simplest general format of an if
statement is
if(test){statement to execute if test is TRUE}
else{statement to execute if test is FALSE}
Here's an example that incorporates both. I have one vector xx
:
> xx
[1] 2 3 4 5 6
And another vector yy
with nothing in it at the moment:
> yy <-NULL
I want the components of yy
to reflect the components of xx
: If a number in xx
is an odd number, I want the corresponding component of yy
to be "ODD"
, and if the xx
number is even, I want the yy
component to be "EVEN"
.
How do I test a number to see whether it's odd or even? Mathematicians have developed modular arithmetic, which is concerned with the remainder of a division operation. If you divide a by b and the result has a remainder of r, mathematicians say that “a modulo b is r.” So 10 divided by 3 leaves a remainder of 1, and 10 modulo 3 is 1. Typically, modulo gets shortened to mod, so that would be “10 mod 3 = 1.”
Most computer languages write 10 mod 3 as mod(10,3)
. (Excel does that, in fact.). R does it differently: R uses the double percent sign (%%
) as its mod operator:
> 10 %% 3
[1] 1
> 5 %% 2
[1] 1
> 4 %% 2
[1] 0
I think you're getting the picture: if xx[i] %% 2 == 0
, then xx[i]
is even. Otherwise, it's odd.
Here, then, is the for
loop and the if
statement:
for(i in 1:length(xx)){
if(xx[i] %% 2 == 0){yy[i]<- "EVEN"}
else{yy[i] <- "ODD"}
}
> yy
[1] "EVEN" "ODD" "EVEN" "ODD" "EVEN"
Chapter 2
IN THIS CHAPTER
Installing packages
Examining data
Exploring a tidy little universe
A package is a collection of functions and data that augments R. If you’re looking for data to work with, you’ll find many data frames in R packages. If you’re looking for a specialized function that’s not in the basic R installation, you can probably find it in a package.