1 Introduction

2 Overview

2.1 `dplyr`

To do: Introduce dplyr (focus is on readable syntax and organizing the analyst’s tasks)

2.2 `data.table`

To do: Introduce data.table (focus is on speed, memory, and concise syntax).

3 Example

We’ll bootstrap the mean kilowat-hours in the 2009 RECS data. For comparisons on larger datasets, please see Matt Dowle’s benchmarks (this also compares to the pandas library for python).

parent <- dirname(getwd())
dataPath <- file.path(parent, "data")

# set the number of Bootstrap resamples
B <- 10000

3.1 Base R

data <- read.csv(file.path(dataPath, "recs2009.csv"))
n <- nrow(data)

bootBase <- array(NA, dim=c(B,4))

t_base <- system.time(for (b in 1:B){
  temp <- data[sample(n, replace=TRUE), c("KWH", "REGIONC")]
  bootBase[b,] <- tapply(temp$KWH, temp$REGIONC, mean)
})[3]

3.2 `dplyr`

Note: I don’t think this is the best way to bootstrap with dplyr. See the broom vignette for more information.

library(readr)
library(dplyr)

data <- read_csv(file.path(dataPath, "recs2009.csv"), progress=FALSE)

bootDplyr <- array(NA, dim=c(B,4))

t_dplyr <- system.time(for (b in 1:B){
  temp <- data %>% select(KWH, REGIONC) %>%
    sample_n(n, replace=TRUE) %>%
    group_by(REGIONC) %>%
    summarise(mean(KWH))
  bootDplyr[b,] <- unlist(temp[,2])
})[3]

3.3 `data.table`

library(data.table)

data <- fread(file.path(dataPath, "recs2009.csv"))

bootData <- array(NA, dim=c(B,4))

t_data <- system.time(for (b in 1:B){
  bootData[b,] <- data[sample(n, replace=TRUE), mean(KWH), by=REGIONC][order(REGIONC),V1]
})[3]

3.4 Comparison

Run time

compare <- c(t_base, t_dplyr, t_data)
names(compare) <- c("base", "dplyr", "data.table")
barplot(compare, ylab = "time (sec)")

Bootstrap distributions

library(reshape2)
library(ggplot2)

# put all bootstrap results in a list for easier processing
bootResults <- list(bootBase, bootDplyr, bootData)
names(bootResults) <- c("base", "dplyr", "data.table")
for (i in 1:length(bootResults)){
  colnames(bootResults[[i]]) <- c("Northeast", "Midwest", "South", "West")
}

# melt to get ready for ggplot2
bootMelt <- melt(bootResults)[,-1]
colnames(bootMelt) <- c("region", "mean", "method")

qplot(x=mean, data = as.data.frame(bootMelt), geom = "histogram")+
  facet_grid(method ~ region, scale = "free")+
  theme_bw(16)+
  scale_x_continuous(breaks = seq(6,15,.5)*1000)

Note: For a more comprehensive comparison, you could repeat the above many times to get a distribution of run times.

4 Exercises

With dplyr and data.table, bootstrap both the mean and the median, grouped by region. Calculate both quantities with a single call to summary and data.table[], respectively. Use B=100 iterations while you are testing your code.
With dplyr and data.table bootstrap the mean for all numeric variables, grouped by region. Do this with a single call to dplyr and data.table. The examples in this stack overflow discussion have useful tips, as well as commentary about dplyr vs data.table.
Watch Grace Hopper explain nanoseconds.

Note: I ran all time comparisons with R version 3.2.3 (2015-12-10) on an intel i7 running at 2.67 GHz with 8 GB RAM.

Computing workshop homepage

Manipulating data quickly with `dplyr` and `data.table` (draft)

Feb 12, 2016

1 Introduction

2 Overview

2.1 `dplyr`

2.2 `data.table`

3 Example

3.1 Base R

3.2 `dplyr`

3.3 `data.table`

3.4 Comparison

4 Exercises

Manipulating data quickly with dplyr and data.table (draft)

Feb 12, 2016

1 Introduction

2 Overview

2.1 dplyr

2.2 data.table

3 Example

3.1 Base R

3.2 dplyr

3.3 data.table

3.4 Comparison

4 Exercises

Manipulating data quickly with `dplyr` and `data.table` (draft)

2.1 `dplyr`

2.2 `data.table`

3.2 `dplyr`

3.3 `data.table`