dplyr
and data.table
(draft)dplyr
To do: Introduce dplyr
(focus is on readable syntax and organizing the analyst’s tasks)
data.table
To do: Introduce data.table
(focus is on speed, memory, and concise syntax).
We’ll bootstrap the mean kilowat-hours in the 2009 RECS data. For comparisons on larger datasets, please see Matt Dowle’s benchmarks (this also compares to the pandas library for python).
parent <- dirname(getwd())
dataPath <- file.path(parent, "data")
# set the number of Bootstrap resamples
B <- 10000
data <- read.csv(file.path(dataPath, "recs2009.csv"))
n <- nrow(data)
bootBase <- array(NA, dim=c(B,4))
t_base <- system.time(for (b in 1:B){
temp <- data[sample(n, replace=TRUE), c("KWH", "REGIONC")]
bootBase[b,] <- tapply(temp$KWH, temp$REGIONC, mean)
})[3]
dplyr
Note: I don’t think this is the best way to bootstrap with dplyr
. See the broom
vignette for more information.
library(readr)
library(dplyr)
data <- read_csv(file.path(dataPath, "recs2009.csv"), progress=FALSE)
bootDplyr <- array(NA, dim=c(B,4))
t_dplyr <- system.time(for (b in 1:B){
temp <- data %>% select(KWH, REGIONC) %>%
sample_n(n, replace=TRUE) %>%
group_by(REGIONC) %>%
summarise(mean(KWH))
bootDplyr[b,] <- unlist(temp[,2])
})[3]
data.table
library(data.table)
data <- fread(file.path(dataPath, "recs2009.csv"))
bootData <- array(NA, dim=c(B,4))
t_data <- system.time(for (b in 1:B){
bootData[b,] <- data[sample(n, replace=TRUE), mean(KWH), by=REGIONC][order(REGIONC),V1]
})[3]
Run time
compare <- c(t_base, t_dplyr, t_data)
names(compare) <- c("base", "dplyr", "data.table")
barplot(compare, ylab = "time (sec)")
Bootstrap distributions
library(reshape2)
library(ggplot2)
# put all bootstrap results in a list for easier processing
bootResults <- list(bootBase, bootDplyr, bootData)
names(bootResults) <- c("base", "dplyr", "data.table")
for (i in 1:length(bootResults)){
colnames(bootResults[[i]]) <- c("Northeast", "Midwest", "South", "West")
}
# melt to get ready for ggplot2
bootMelt <- melt(bootResults)[,-1]
colnames(bootMelt) <- c("region", "mean", "method")
qplot(x=mean, data = as.data.frame(bootMelt), geom = "histogram")+
facet_grid(method ~ region, scale = "free")+
theme_bw(16)+
scale_x_continuous(breaks = seq(6,15,.5)*1000)
Note: For a more comprehensive comparison, you could repeat the above many times to get a distribution of run times.
With dplyr
and data.table
, bootstrap both the mean and the median, grouped by region. Calculate both quantities with a single call to summary
and data.table[]
, respectively. Use B=100
iterations while you are testing your code.
With dplyr
and data.table
bootstrap the mean for all numeric variables, grouped by region. Do this with a single call to dplyr
and data.table
. The examples in this stack overflow discussion have useful tips, as well as commentary about dplyr
vs data.table
.
Note: I ran all time comparisons with R version 3.2.3 (2015-12-10) on an intel i7 running at 2.67 GHz with 8 GB RAM.