1 Introduction

In this tutorial, we will focus on three R packages from Hadley Wickham that are typically used together: dplyr (split, apply, combine), reshape2 (reshape), and ggplot2 (plot). There are several good tutorials for these packages (see Hadley Wickham’s Introduction to dplyr, Sean Anderson’s Introduction to reshape2, and Hadley Wickham’s website for the reshape. Wickham’s website for ggplot2 describes all of the options and gives many examples. Since there are so many existing resources for dplyr, reshape2, and ggplot2, I thought it would be most helpful to give a brief overview of each, and then jump into some examples where we use them together.

2 Brief Overview

2.1 Split, apply, combine with dplyr

Many common data processing tasks fit into the split, apply, and combine framework, which was popularized in the R community by Hadley Wickham. In this framework, we

  1. Split the dataset into subsets
  2. Apply a function to each subset
  3. Combine the results into either a new:
    • Dataset
    • Column

For more background, see Wickham’s 2011 paper, The Split-Apply-Combine Strategy for Data Analysis. It is easy to implement the split, apply, combine approach with Wickham’s dplyr package, which replaces the plyr package, and is much faster. In dplyr, we

  1. Split using group_by
  2. Apply a function, e.g. mean
  3. Combine into a:
    • New dataset with summarize
    • New column, keeping old columns, with mutate
    • New column, dropping old columns, with transmute

You can chain these functions together using pipes, written as %>%. For example, suppose our data is stored in myData. Then

classSummary <- myData %>% 
  group_by(class) %>%
  summarize(numberOfDogs <- sum(dogs),
            timesDogAteHw <- sum(DogAteHw),
            meanGPA <- mean(GPA)
  )    

does the following:

  1. myData is passed to group_by
  2. group_by subsets myData by class
  3. For each subset,
    • summarize calculates numberOfDogs, timesDogAteHw, and meanGPA
  4. All summaries are combined into a data frame and assigned to classSummary

The result, classSummary, is in wide format, i.e. it has one row per class, and separate columns for each summarized variable (numberOfDogs, timesDogAteHw, meanGPA). Please see below for an example.

2.2 Reshape with reshape2

You will usually need to reshape the dataset from wide to long format before plotting with ggplot2. With reshape2, this is accomplished with the melt function. You can specify the grouping variable with id.vars, and the variables you want to plot with measure.vars. For example, if we wanted to plot the number of dogs by the number of times dogs ate homework, aggregated at the class level, we would first need to melt classSummary.

classSummaryMelt <- melt(classSummary, id.vars = "class", measure.vars = c("numberOfDogs", "timesDogAteHw"))

classSummaryMelt has three columns:

  • class, which identifies the class
  • variable, which identifies the measured variables numberOfDogs and timesDogAteHw
  • value, which contains the value for each class/variable combination

Please see below for an example.

If you don’t specify measure.vars, melt assumes that all variables except for the id.vars are measured variables. Frequently, you don’t need to specify the id.vars either.

You can also use reshape2 to go from long to wide format with dcast for data frames and acast for arrays, though we won’t cover that here. When going from long to wide format, you can aggregate by specifying a fun.aggregate function.

2.3 Plot with ggplot2

ggplot2 is based on Leland Wilkinson’s Grammar of Graphics, which is also the basis for the Bokeh library for Python.

In ggplot2, plots are composed of layers, and each layer must contain at least the following components:

  • Aesthetic (aes), e.g. aes(x = xInData, y = yInData)
  • Data, data = myData
  • Geometric object, e.g. geom_point, geom_line

There are several other options, some of which we’ll use in the examples below. You also don’t have to re-specify all components for each layer, unless they change.

There are two ways to create a plot:

  1. Short syntax with qplot (q for quick)
  2. Full syntax with ggplot

The following are equivalent:

qplot(x = age, y = greyHair, data = myData, geom = "line")

ggplot(aes(x = age, y = greyHair), data = myData)+
  geom_line()

We’ll use both qplot and ggplot below. While qplot is more concise, it isn’t as flexible as ggplot.

3 Examples

In these example, we’ll use several aspects of ggplot2 that I didn’t mention above. I’ll explain these in the workshop as we go.

Let’s take a look at the electricity consumption data in killowat-hours (KWH). We’ll also use the the sampling weights, NWEIGHT, and U.S. Census regions REGIONC. There are a few parsing errors, which I deal with in a separate tutorial. For now, we can ignore the errors.

library(readr)
data <- read_csv("http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv",
  progress=FALSE)
## Warning: 30 parsing failures.
##  row         col               expected actual
## 1538 GALLONFOOTH no trailing characters   .724
## 1538 BTUFOOTH    no trailing characters   .732
## 1538 DOLFOOTH    no trailing characters   .373
## 3628 NUMCORDS    no trailing characters   .5  
## 3981 GALLONFOOTH no trailing characters   .681
## .... ........... ...................... ......
## .See problems(...) for more details.

Before jumping into the Hadleyverse, let’s use R base functions to plot regional means.

data$REGIONC <- factor(data$REGIONC, labels=c("Northeast","Midwest","South","West"))

kwhMean <- tapply(data$KWH, data$REGIONC, mean)
barplot(kwhMean, ylab="kWh", xlab="Region")

Now let’s use Wickham’s packages.

library(reshape2)
library(dplyr)
library(ggplot2)

# Use dplyr to get both the weighted and unweighted means
KwhByRegion <- data %>%
  group_by(REGIONC) %>%
  summarize(
    weighted = weighted.mean(x = KWH, w = NWEIGHT),
    unweighted = mean(KWH)
  )
KwhByRegion
## Source: local data frame [4 x 3]
## 
##     REGIONC  weighted unweighted
##      (fctr)     (dbl)      (dbl)
## 1 Northeast  8081.680   7933.815
## 2   Midwest 10583.248  11292.783
## 3     South 14560.929  14762.980
## 4      West  9305.907   8991.274
# Use reshape2 to prep for gpplot2
KwhByRegionMelt <- melt(KwhByRegion, id.vars = "REGIONC")
KwhByRegionMelt
##     REGIONC   variable     value
## 1 Northeast   weighted  8081.680
## 2   Midwest   weighted 10583.248
## 3     South   weighted 14560.929
## 4      West   weighted  9305.907
## 5 Northeast unweighted  7933.815
## 6   Midwest unweighted 11292.783
## 7     South unweighted 14762.980
## 8      West unweighted  8991.274
# plot with ggplot
ggplot(aes(x = factor(REGIONC), y = value, fill = variable), data = KwhByRegionMelt)+
  geom_bar(stat = "identity", position = "dodge")+
  theme_bw(16)+
  labs(x = "Region", y = "kWh", title = "Regional Kilowatt-Hour Usage 2009 \n Weighted vs Unweighted Mean")+
  scale_fill_discrete("")

It looks like the South uses a lot of electricity. Is it because they have a lot of cooling degree days CDD65?

ggplot(aes(x = CDD65, y = KWH^(1/4)), data = data)+
  geom_point(alpha = 0.03)+
  geom_smooth()+
  facet_wrap(~REGIONC)+
  theme_bw(18)+
  geom_hline(yintercept = mean(data$KWH^(1/4)), color = "red", linetype = "dashed")+
  labs(y = expression("kWh"^"1/4"), x = "Cooling days (base temp 65 F)")

I used a 1/4 scaling factor to induce normality for KWH, which makes the standard errors more reliable for the smoothing estimates. We’ll talk more about this and geom_smooth below. We’ll also plot KWH on it’s original scale.

For a more direct visual comparison, instead of faceting by region, let’s use color to group by region.

ggplot(aes(x = CDD65, y = KWH^(1/4), color = REGIONC), data = data)+
  geom_point(alpha = 0.075)+
  geom_smooth()+
  theme_bw(18)+
  labs(y = expression("kWh"^"1/4"), x = "Cooling days (base temp 65 F)")+
  scale_color_discrete("Region")

Even after adjusting for cooling days, the south seems to use more electricity.

Now for that 1/4 scaling factor. Let’s look at a histogram of KWH.

qplot(x = data$KWH, geom = "histogram", xlab = "kWh")+
  theme_classic(18)

That’s definitely not a normal distribuiton. Maybe it’s exponential? What happens if we take a log transform?

qplot(x = log(data$KWH), geom = "histogram", xlab = "log(kWh)")+
  theme_classic(18)

That’s still skewed, but raising KWH to a 1/4 seems to induce normality.

qplot(x = data$KWH^(1/4), geom = "histogram", xlab = expression("kWh"^"1/4"))+
  theme_classic(18)

Let’s spiff up the plot by using a kernel density estimator and overlaying a normal distribution in red. geom_density calls the density function from the stats package.

ggplot(aes(x = KWH^(1/4)), data = data)+
  geom_density()+
  stat_function(fun = dnorm, arg = list(mean = mean(data$KWH^(1/4)), sd = sd(data$KWH^(1/4))), 
    color="red")+
  theme_classic(18)+
  labs(x = expression("kWh"^"1/4"), y = "Density")

We could also make the lines thicker and fill in the kernel density estimator.

ggplot(aes(x = KWH^(1/4)), data = data)+
  geom_density(color = "grey", fill = "grey", size = 1)+
  stat_function(fun = dnorm,
    arg = list(mean = mean(data$KWH^(1/4)), sd = sd(data$KWH^(1/4))), 
    color = "red", size = 1)+
  theme_classic(18)+
  labs(x = expression("kWh"^"1/4"), y = "Density")

We can also look at a q-q plot.

ggplot(aes(sample = KWH^(1/4)), data = data)+
  stat_qq()+
  theme_classic(18)

Scaling the outcome is convenient for mathematical reasons, but it makes interpretation difficult. The gamma distribution is a flexible choice for this type of data, and seems to describe the unconditional distribution of KWH well. We should also check the mean-variance relationship and the conditional distribution of KWH given CDD65, but we’ll assume they’re ok for now.

# estimate parameters of gamma with method of moments
mu <- mean(data$KWH)
sigma2 <- var(data$KWH)
alpha <- mu^2/sigma2
beta <- sigma2/mu
    
ggplot(aes(x = KWH), data = data)+
  geom_density(color = "grey", fill = "grey", size = 1)+
  stat_function(fun = dgamma, arg = list(shape = alpha, scale = beta), 
                color = "red", size = 1)+
  theme_classic(18)+
  labs(x = "kWh", y = "Density")

Then we can make the same plots as above, but on the original scale, and ask the gam function to use the gamma distribution to fit the smooth.

ggplot(aes(x = CDD65, y = KWH, color = REGIONC), data = data)+
  geom_point(alpha = 0.075)+
  geom_smooth(method = "gam", method.args = list(family = Gamma))+
  theme_bw(18)+
  labs(y = "kWh", x = "Cooling days (base temp 65 F)")+
  scale_color_discrete("Region")+
  scale_y_continuous(lim = c(0,50000))
## Warning: Removed 29 rows containing non-finite values (stat_smooth).
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : Algorithm did not converge
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : Algorithm did not converge
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : Algorithm did not converge
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : Algorithm did not converge
## Warning: Removed 29 rows containing missing values (geom_point).

The gam algorithm didn’t converge, so these smooths might not be quite right, but they seem reasonable for exploratory purposes. We also didn’t use the sampling weights in the smooth.

For more than 1,000 observations, geom_smooth calls the gam function from Simon Wood’s mgcv package. gam stands for generalized additive model, but in this case, the smooth is a function of only one variable – cooling days. gam uses cubic regression B-splines by default, so the smooths in the plots above are cubic regression B-splines within each region. As an aside, I highly recommend Simon Wood’s textbook, Generalized Additive Models: An Introduction with R.

4 Common mistakes

5 Exercises

dplyr

  1. Alter the dplyr call above to group by both REGIONC and DIVISON.
  2. Use n=n() in summarize to get the number of households in each region/division.
  3. Instead of summarizing the data, add columns with mutate and transmute.

reshape2

  1. Melt the data frame from exercise 2, but use both REGIONC and DIVISON as id.vars.

ggplot2

  1. Make plots that compare KWH across DIVISION instead of REGIONC.
  2. Facet the plots from exercise 5 by REGIONC and add titles.
  3. Smooth on the original scale but assuming that KWH is normally distributed.
  4. Make a few other plots, e.g. bloxplots faceted by REGIONC. See the ggplot2 website for different choices.

6 Final note

More recently, Hadley Wickham came out with the tidyr package. In introducing tidyr Wickham says:

Just as reshape2 did less than reshape, tidyr does less than reshape2. It’s designed specifically for tidying data, not general reshaping. In particular, existing methods only work for data frames, and tidyr never aggregates.


Computing workshop homepage