Amelia McNamara
June 8, 2017
There are many ways to make graphics in R.
ggplot2 is an R package by Hadley Wickham that lets you make beautiful R graphics (relatively) easily.
It’s part of the tidyverse, which I recommend everyone get to know (dplyr, stringr, lubridate, broom… and many more).
The name ggplot2 refers to The Grammar of Graphics, and it is an implementation of Wilkinson’s ideas in R.
Let’s start by going through the intro to R and RStudio lab. You’re going to learn lots more about R as the weeks progress, but we need you to have a few basic skills for this seminar.
R has many “packages,” which are add-ons to the basic functionality of the language. To use a package, you need to install it (once) and load it (every time you want to use it).
install.packages("ggplot2")
library(ggplot2)
To start, I’m going to load some data,
arbuthnot <- read.csv("http://www.openintro.org/stat/data/arbuthnot.csv", header=T)
glimpse(arbuthnot)
## Observations: 82
## Variables: 3
## $ year <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 16...
## $ boys <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 53...
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 49...
qplot(x = year, y = girls, data = arbuthnot)
In order to get qplot()
to work, you need to list the variable(s) you want to plot, and then tel R where to “look” for that variable with “data=”.
Since it’s a quick plot, R will guess what kind of mapping you want for your variables.
But, in order to really harness the power of ggplot2 you need to use the more general ggplot() command. The idea of the package is you can “layer” pieces on top of a plot to build it up over time.
You always need to use a ggplot() call to initialize the plot. I usually put my dataset in here, and at least some of my “aesthetics.” But, one of the things that can make ggplot2 tough to understand is that there are no hard and fast rules.
p1 <- ggplot(aes(x=year, y=girls), data=arbuthnot)
If you try to show p1 at this point, you will get “Error: No layers in plot.” This is because we haven’t given it any geometric objects yet.
In order to get a plot to work, you need to use “geoms” (geometric objects) to specify the way you want your variables mapped to graphical parameters.
p1 + geom_point()
ggplot(aes(x=year, y=girls), data=arbuthnot) + geom_point()
Or
ggplot() + geom_point(aes(x=year, y=girls), data=arbuthnot)
Or
ggplot(arbuthnot) + geom_point(aes(x=year, y=girls))
p1 + geom_bin2d()
Notice that I haven’t been saving these geom layers– I’m just running
p1 + [something]
to see what happens. But, I can save the new version to start building up my plot,
p2 <- p1 + geom_point()
p2 <- p2 + xlab("Number of girls born") + ylab("Year") +
guides(fill=guide_legend(title="Number of births from Arbuthnot data"))
p2
For your lab, you are going to play with the American Time Use Survey data.
atus <- read.csv("https://raw.githubusercontent.com/AmeliaMN/SummerDataViz/master/IntroToViz/atus.csv", header=T)
The ATUS is a product of the Bureau of Labor Statistics. Each row is a person, and each variable is some information about that person. The first few variables are demographic, and the rest are the number of minutes per day (on average) the person spends on a variety of activities.
What is the strongest relationship between two variables you can find in the data?
Make a scatterplot of atus
data, but remove the grey-and-white background.
Make a plot (or plots) to help you determine which state has the most veterans.
Are most veterans married or not?