In this tutorial, we will focus on three R packages from Hadley Wickham that are typically used together: dplyr
(split, apply, combine), reshape2
(reshape), and ggplot2
(plot). There are several good tutorials for these packages (see Hadley Wickham’s Introduction to dplyr
, Sean Anderson’s Introduction to reshape2
, and Hadley Wickham’s website for the reshape
. Wickham’s website for ggplot2
describes all of the options and gives many examples. Since there are so many existing resources for dplyr
, reshape2
, and ggplot2
, I thought it would be most helpful to give a brief overview of each, and then jump into some examples where we use them together.
dplyr
Many common data processing tasks fit into the split, apply, and combine framework, which was popularized in the R community by Hadley Wickham. In this framework, we
For more background, see Wickham’s 2011 paper, The Split-Apply-Combine Strategy for Data Analysis. It is easy to implement the split, apply, combine approach with Wickham’s dplyr
package, which replaces the plyr
package, and is much faster. In dplyr
, we
group_by
mean
summarize
mutate
transmute
You can chain these functions together using pipes, written as %>%
. For example, suppose our data is stored in myData
. Then
classSummary <- myData %>%
group_by(class) %>%
summarize(numberOfDogs <- sum(dogs),
timesDogAteHw <- sum(DogAteHw),
meanGPA <- mean(GPA)
)
does the following:
myData
is passed to group_by
group_by
subsets myData
by class
summarize
calculates numberOfDogs
, timesDogAteHw
, and meanGPA
classSummary
The result, classSummary
, is in wide format, i.e. it has one row per class, and separate columns for each summarized variable (numberOfDogs
, timesDogAteHw
, meanGPA
). Please see below for an example.
reshape2
You will usually need to reshape the dataset from wide to long format before plotting with ggplot2
. With reshape2
, this is accomplished with the melt
function. You can specify the grouping variable with id.vars
, and the variables you want to plot with measure.vars
. For example, if we wanted to plot the number of dogs by the number of times dogs ate homework, aggregated at the class level, we would first need to melt classSummary
.
classSummaryMelt <- melt(classSummary, id.vars = "class", measure.vars = c("numberOfDogs", "timesDogAteHw"))
classSummaryMelt
has three columns:
class
, which identifies the classvariable
, which identifies the measured variables numberOfDogs
and timesDogAteHw
value
, which contains the value for each class/variable combinationPlease see below for an example.
If you don’t specify measure.vars
, melt
assumes that all variables except for the id.vars
are measured variables. Frequently, you don’t need to specify the id.vars
either.
You can also use reshape2
to go from long to wide format with dcast
for data frames and acast
for arrays, though we won’t cover that here. When going from long to wide format, you can aggregate by specifying a fun.aggregate
function.
ggplot2
ggplot2
is based on Leland Wilkinson’s Grammar of Graphics, which is also the basis for the Bokeh library for Python.
In ggplot2
, plots are composed of layers, and each layer must contain at least the following components:
aes
), e.g. aes(x = xInData, y = yInData)
data = myData
geom_point
, geom_line
There are several other options, some of which we’ll use in the examples below. You also don’t have to re-specify all components for each layer, unless they change.
There are two ways to create a plot:
qplot
(q for quick)ggplot
The following are equivalent:
qplot(x = age, y = greyHair, data = myData, geom = "line")
ggplot(aes(x = age, y = greyHair), data = myData)+
geom_line()
We’ll use both qplot
and ggplot
below. While qplot
is more concise, it isn’t as flexible as ggplot
.
In these example, we’ll use several aspects of ggplot2
that I didn’t mention above. I’ll explain these in the workshop as we go.
Let’s take a look at the electricity consumption data in killowat-hours (KWH
). We’ll also use the the sampling weights, NWEIGHT
, and U.S. Census regions REGIONC
. There are a few parsing errors, which I deal with in a separate tutorial. For now, we can ignore the errors.
library(readr)
data <- read_csv("http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv",
progress=FALSE)
## Warning: 30 parsing failures.
## row col expected actual
## 1538 GALLONFOOTH no trailing characters .724
## 1538 BTUFOOTH no trailing characters .732
## 1538 DOLFOOTH no trailing characters .373
## 3628 NUMCORDS no trailing characters .5
## 3981 GALLONFOOTH no trailing characters .681
## .... ........... ...................... ......
## .See problems(...) for more details.
Before jumping into the Hadleyverse, let’s use R base functions to plot regional means.
data$REGIONC <- factor(data$REGIONC, labels=c("Northeast","Midwest","South","West"))
kwhMean <- tapply(data$KWH, data$REGIONC, mean)
barplot(kwhMean, ylab="kWh", xlab="Region")
Now let’s use Wickham’s packages.
library(reshape2)
library(dplyr)
library(ggplot2)
# Use dplyr to get both the weighted and unweighted means
KwhByRegion <- data %>%
group_by(REGIONC) %>%
summarize(
weighted = weighted.mean(x = KWH, w = NWEIGHT),
unweighted = mean(KWH)
)
KwhByRegion
## Source: local data frame [4 x 3]
##
## REGIONC weighted unweighted
## (fctr) (dbl) (dbl)
## 1 Northeast 8081.680 7933.815
## 2 Midwest 10583.248 11292.783
## 3 South 14560.929 14762.980
## 4 West 9305.907 8991.274
# Use reshape2 to prep for gpplot2
KwhByRegionMelt <- melt(KwhByRegion, id.vars = "REGIONC")
KwhByRegionMelt
## REGIONC variable value
## 1 Northeast weighted 8081.680
## 2 Midwest weighted 10583.248
## 3 South weighted 14560.929
## 4 West weighted 9305.907
## 5 Northeast unweighted 7933.815
## 6 Midwest unweighted 11292.783
## 7 South unweighted 14762.980
## 8 West unweighted 8991.274
# plot with ggplot
ggplot(aes(x = factor(REGIONC), y = value, fill = variable), data = KwhByRegionMelt)+
geom_bar(stat = "identity", position = "dodge")+
theme_bw(16)+
labs(x = "Region", y = "kWh", title = "Regional Kilowatt-Hour Usage 2009 \n Weighted vs Unweighted Mean")+
scale_fill_discrete("")
It looks like the South uses a lot of electricity. Is it because they have a lot of cooling degree days CDD65
?
ggplot(aes(x = CDD65, y = KWH^(1/4)), data = data)+
geom_point(alpha = 0.03)+
geom_smooth()+
facet_wrap(~REGIONC)+
theme_bw(18)+
geom_hline(yintercept = mean(data$KWH^(1/4)), color = "red", linetype = "dashed")+
labs(y = expression("kWh"^"1/4"), x = "Cooling days (base temp 65 F)")
I used a 1/4 scaling factor to induce normality for KWH
, which makes the standard errors more reliable for the smoothing estimates. We’ll talk more about this and geom_smooth
below. We’ll also plot KWH
on it’s original scale.
For a more direct visual comparison, instead of faceting by region, let’s use color
to group by region.
ggplot(aes(x = CDD65, y = KWH^(1/4), color = REGIONC), data = data)+
geom_point(alpha = 0.075)+
geom_smooth()+
theme_bw(18)+
labs(y = expression("kWh"^"1/4"), x = "Cooling days (base temp 65 F)")+
scale_color_discrete("Region")
Even after adjusting for cooling days, the south seems to use more electricity.
Now for that 1/4 scaling factor. Let’s look at a histogram of KWH
.
qplot(x = data$KWH, geom = "histogram", xlab = "kWh")+
theme_classic(18)
That’s definitely not a normal distribuiton. Maybe it’s exponential? What happens if we take a log transform?
qplot(x = log(data$KWH), geom = "histogram", xlab = "log(kWh)")+
theme_classic(18)
That’s still skewed, but raising KWH
to a 1/4 seems to induce normality.
qplot(x = data$KWH^(1/4), geom = "histogram", xlab = expression("kWh"^"1/4"))+
theme_classic(18)
Let’s spiff up the plot by using a kernel density estimator and overlaying a normal distribution in red. geom_density
calls the density
function from the stats
package.
ggplot(aes(x = KWH^(1/4)), data = data)+
geom_density()+
stat_function(fun = dnorm, arg = list(mean = mean(data$KWH^(1/4)), sd = sd(data$KWH^(1/4))),
color="red")+
theme_classic(18)+
labs(x = expression("kWh"^"1/4"), y = "Density")
We could also make the lines thicker and fill in the kernel density estimator.
ggplot(aes(x = KWH^(1/4)), data = data)+
geom_density(color = "grey", fill = "grey", size = 1)+
stat_function(fun = dnorm,
arg = list(mean = mean(data$KWH^(1/4)), sd = sd(data$KWH^(1/4))),
color = "red", size = 1)+
theme_classic(18)+
labs(x = expression("kWh"^"1/4"), y = "Density")
We can also look at a q-q plot.
ggplot(aes(sample = KWH^(1/4)), data = data)+
stat_qq()+
theme_classic(18)
Scaling the outcome is convenient for mathematical reasons, but it makes interpretation difficult. The gamma distribution is a flexible choice for this type of data, and seems to describe the unconditional distribution of KWH
well. We should also check the mean-variance relationship and the conditional distribution of KWH
given CDD65
, but we’ll assume they’re ok for now.
# estimate parameters of gamma with method of moments
mu <- mean(data$KWH)
sigma2 <- var(data$KWH)
alpha <- mu^2/sigma2
beta <- sigma2/mu
ggplot(aes(x = KWH), data = data)+
geom_density(color = "grey", fill = "grey", size = 1)+
stat_function(fun = dgamma, arg = list(shape = alpha, scale = beta),
color = "red", size = 1)+
theme_classic(18)+
labs(x = "kWh", y = "Density")
Then we can make the same plots as above, but on the original scale, and ask the gam
function to use the gamma distribution to fit the smooth.
ggplot(aes(x = CDD65, y = KWH, color = REGIONC), data = data)+
geom_point(alpha = 0.075)+
geom_smooth(method = "gam", method.args = list(family = Gamma))+
theme_bw(18)+
labs(y = "kWh", x = "Cooling days (base temp 65 F)")+
scale_color_discrete("Region")+
scale_y_continuous(lim = c(0,50000))
## Warning: Removed 29 rows containing non-finite values (stat_smooth).
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : Algorithm did not converge
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : Algorithm did not converge
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : Algorithm did not converge
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : Algorithm did not converge
## Warning: Removed 29 rows containing missing values (geom_point).
The gam
algorithm didn’t converge, so these smooths might not be quite right, but they seem reasonable for exploratory purposes. We also didn’t use the sampling weights in the smooth.
For more than 1,000 observations, geom_smooth
calls the gam
function from Simon Wood’s mgcv
package. gam
stands for generalized additive model, but in this case, the smooth is a function of only one variable – cooling days. gam
uses cubic regression B-splines by default, so the smooths in the plots above are cubic regression B-splines within each region. As an aside, I highly recommend Simon Wood’s textbook, Generalized Additive Models: An Introduction with R.
color = "green"
inside aes()
. This does not make your points green.dplyr
dplyr
call above to group by both REGIONC
and DIVISON
.n=n()
in summarize
to get the number of households in each region/division.mutate
and transmute
.reshape2
REGIONC
and DIVISON
as id.vars
.ggplot2
KWH
across DIVISION
instead of REGIONC
.REGIONC
and add titles.KWH
is normally distributed.REGIONC
. See the ggplot2
website for different choices.More recently, Hadley Wickham came out with the tidyr
package. In introducing tidyr
Wickham says:
Just as reshape2 did less than reshape, tidyr does less than reshape2. It’s designed specifically for tidying data, not general reshaping. In particular, existing methods only work for data frames, and tidyr never aggregates.