1 Getting started

To follow along, please download the datasets in the previous tutorial, and place all downloaded files in a folder called data at the same level as the working directory. Alternatively, you could change dataPath in the first code block to point to the location of the data. You could also clone the repository and run the download_files.R script in the data folder.

2 Managing files

Managing files from within R can be useful in some situations, like making reproducible reports. Let’s see what’s in our data folder.

parent <- dirname(getwd())
dataPath <- file.path(parent, "data")

list.files(dataPath)

## [1] "download_files.R"  "public_layout.csv" "recs2009.csv"     
## [4] "train.csv"         "train.zip"

We’re only interested in the csv files, so let’s just look at those. We’ll save the file names so we can use them later.

csvFiles <- list.files(dataPath, pattern = "*.csv$")
csvFiles

## [1] "public_layout.csv" "recs2009.csv"      "train.csv"

Let’s see how large the files are. We’re going to work with the recs2009.csv file, which contains data from the 2009 residential energy consumption survey (RECS).

setwd(dataPath)
info <- file.info(path = csvFiles)
info

##                         size isdir mode               mtime
## public_layout.csv      59883 FALSE  666 2016-02-11 19:55:02
## recs2009.csv        27460827 FALSE  666 2016-02-12 10:31:46
## train.csv         1942848724 FALSE  666 2016-02-12 10:45:43
##                                 ctime               atime exe
## public_layout.csv 2016-02-06 18:45:11 2016-02-11 19:55:02  no
## recs2009.csv      2016-02-06 18:50:06 2016-02-06 18:50:06  no
## train.csv         2016-02-12 10:44:48 2016-02-12 10:44:48  no

size <- info["recs2009.csv", "size"]
size/1e6

## [1] 27.46083

So the recs2009.csv file is 27 MB – a very manageable size.

3 Reading in data

Base R functions and data structures, such as read.csv and data.frame, are fine for most datasets. However, for larger datasets, the readr and data.table packages are noticeably faster.

# install.packages(c("readr", "data.table"))
library(readr)
library(data.table)

At 27 MB, the recs2009.csv file isn’t large. However, if we were working with many files of this size, readr and data.table would save a lot of time.

# Compare read-in times
t_base <- system.time(data <- read.csv(file.path(dataPath, "recs2009.csv")))[3] # base R

t_fread <- system.time(data <- fread(file.path(dataPath, "recs2009.csv"), showProgress = FALSE))[3] # data.table

t_readr <- system.time(data <- read_csv(file.path(dataPath, "recs2009.csv"), progress = FALSE))[3] # readr

## Warning: 30 parsing failures.
##  row         col               expected actual
## 1538 GALLONFOOTH no trailing characters   .724
## 1538 BTUFOOTH    no trailing characters   .732
## 1538 DOLFOOTH    no trailing characters   .373
## 3628 NUMCORDS    no trailing characters   .5  
## 3981 GALLONFOOTH no trailing characters   .681
## .... ........... ...................... ......
## .See problems(...) for more details.

compare <- c(t_base, t_readr, t_fread)
names(compare) <- c("read.csv", "read_csv", "fread")

# times relative to read.csv
signif(compare / compare[1], 2)

## read.csv read_csv    fread 
##     1.00     0.17     0.06

barplot(compare, ylab = "time (sec)", 
  main = paste("Time to read in ", round(size/1e6),
  " MB csv file", sep = ""))

We downloaded the recs2009.csv file earlier to save time, but you can also pass the url directly to read.csv, read_csv, and fread.

read_csv had trouble parsing some records. The tbl_df object records these problems as an attribute.

head(problems(data))

## Source: local data frame [6 x 4]
## 
##     row         col               expected actual
##   (int)       (chr)                  (chr)  (chr)
## 1  1538 GALLONFOOTH no trailing characters   .724
## 2  1538    BTUFOOTH no trailing characters   .732
## 3  1538    DOLFOOTH no trailing characters   .373
## 4  3628    NUMCORDS no trailing characters     .5
## 5  3981 GALLONFOOTH no trailing characters   .681
## 6  3981    BTUFOOTH no trailing characters   .778

Before analyzing the data, we would want to look at the original file and figure out the problem.

Note: read-in times can vary. For a more complete comparison, you could do a simulation study (read in the same data many times to get the means, quantiles, and standard deviations of the read-in times).

Now let’s see what happens with a larger file.

size <- info["train.csv", "size"]
size/1e6

## [1] 1942.849

The train.csv file contains the trajectories of 442 taxis in the city of Porto, Portugal for one complete year (from 7/1/2013 to 6/30/2014). The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) hosted a competition through Kaggle with this dataset last summer. The data are publicly available on the UCI Machine Learning Repository.

At 1,943 MB, the train.csv file is too large for read.csv. The following code crashed my computer, which has 8 GB of memory:

# warning: the following might crash your computer
# read.csv(file.path(dataPath, 'train.csv'))

However, we can still read in the file with readr and data.table. We can use the rm and gc functions to remove large files and return the memory to the operating system (gc for garbage collection). Typically, we only need to call rm (see Hadley Wikham’s Advanced R). However, when I monitor memory use with the windows task manager, I don’t see a decrease until after I call gc. I don’t know if it helps, but it doesn’t hurt, so sometimes I call gc to be safe.

system.time calls gc by default, so we don’t need gc after the first rm.

To keep the comparison below fair, I specified the column types. If you don’t specify column types with this example, read_csv does a better job at guessing which columns are integers. fread guessed that everything was a character string, and actually took longer to read in the data.

t_readr <- system.time(data <- read_csv(file.path(dataPath, 'train.csv'),
  col_types=cols(
    TRIP_ID = col_character(),
    CALL_TYPE = col_character(),
    ORIGIN_CALL = col_integer(),
    ORIGIN_STAND = col_integer(),
    TAXI_ID = col_integer(),
    TIMESTAMP = col_integer(),
    DAY_TYPE = col_character(),
    MISSING_DATA = col_character(),
    POLYLINE = col_character()),
    progress=FALSE)
    )[3]
rm(data)

t_fread <- system.time(data <- fread(file.path(dataPath, 'train.csv'), 
  colClasses = c("character", "character", rep("integer", 4), rep("character", 3)),
  showProgress = FALSE)
  )[3]
rm(data)
gc()

##           used (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  421542 22.6    5309345  283.6   5565496  297.3
## Vcells 4782281 36.5  290816354 2218.8 265207694 2023.4

compare <- c(t_readr, t_fread)
names(compare) <- c("read_csv", "fread")

barplot(compare, ylab = "time (sec)", 
  main = paste("Time to read in ", prettyNum(round(size / 1e6), big.mark = ","), " MB csv file", sep = ""))

fread from data.table is typically faster than read_csv from readr. fread creates a data.table instead of a data.frame, which uses different syntax and doesn’t work as well with dplyr. I think the syntax is fine, though, and operations with data.table are fast. If you need to convert a data.table to a data.frame, you can use the as.data.frame function. See Hadley Wickham’s github page on readr for more information on readr.

4 Working with data too large for memory

There are several options for working with datasets too large to fit into memory. As one quick approach, you can use gawk command line tools from within R to read in a subset of the data. This can help you to get a sense for the data before implementing another computing solution, some of which I mention below. We’ll use the train.csv dataset, and treat it as if it were too big for memory.

setwd(dataPath)

# read in every 250th line
filePipe <- pipe("gawk 'BEGIN{i=0}{i++; if(i%250==0) print $1}' < train.csv")
system.time(train <- read.table(filePipe, sep = ","))

##    user  system elapsed 
##    4.12   17.12   35.35

header <- read.csv("train.csv", nrow = 1)
colnames(train) <- colnames(header)

str(train)

## 'data.frame':    6842 obs. of  9 variables:
##  $ TRIP_ID     : num  1.37e+18 1.37e+18 1.37e+18 1.37e+18 1.37e+18 ...
##  $ CALL_TYPE   : Factor w/ 3 levels "A","B","C": 3 2 2 3 1 3 2 2 3 2 ...
##  $ ORIGIN_CALL : int  NA NA NA NA 14558 NA NA NA NA NA ...
##  $ ORIGIN_STAND: int  NA 52 61 NA NA NA 15 NA NA 23 ...
##  $ TAXI_ID     : int  20000653 20000076 20000341 20000320 20000432 20000598 20000268 20000451 20000455 20000463 ...
##  $ TIMESTAMP   : int  1372649868 1372662088 1372665287 1372664244 1372669608 1372673543 1372672664 1372678992 1372670258 1372678111 ...
##  $ DAY_TYPE    : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MISSING_DATA: Factor w/ 1 level "False": 1 1 1 1 1 1 1 1 1 1 ...
##  $ POLYLINE    : Factor w/ 6820 levels "[[-7.787205,41.088267],[-7.783578,41.090211],[-7.782561,41.093874],[-7.781742,41.096961],[-7.779546,41.099697],[-7.776018,41.10"| __truncated__,..: 6084 3482 1224 2028 5796 822 707 4074 2990 3296 ...

Gawk is a GNU implementation of awk, a command line file processing tool for Unix. For more information, see the gawk home page. Kerby Shedden’s STAT 506 class notes also have some pointers for using awk.

This is a quick way to peak at large datasets with R. However, the subset might be a biased sample from the full dataset, especially if the variables (columns) have non-random trends across the observations (rows). In some cases, you might be able to repeat the same analysis with multiple samples, and then aggregate the results. Ariel Kleiner, et al.’s Bag of Little Bootstraps is an example of this approach. I think it was developed with a distributed computing system in mind, but you could also implement it on desktop.

Depending on your situation, you might also consider:

Calculating sufficient statistics on a data stream (e.g. by writing a function in C, see Hyun Kang’s BIOSAT 615 class notes, lecture 14, slides 9-13)
Converting the data into a Hierarchical Data Format version 5 (HDF5) file and using the rhdf5 package (see Kerby Shedden’s STAT 506 class notes)
Storing the dataset in a SQL database, and connecting via dplyr or another package (see Hadley Wickham’s vignette for connecting dplyr to remote databases). ff is another package that works with data stored on disk. To fully take advantage of these options, it would help to have a solid state drive (SSD).
Using a distributed computing system, like Spark or Hadoop (see the schedule of Advanced Research Computing workshops)
Using the bigmemory package to more efficiently store matrices in memory.

5 Other notes/tips

When working with large files, you can monitor memory use with the windows task manager.
The POLYLINE variable in the train.csv dataset is in JSON format. You can use an R package, such as json, to parse this variable.
If you want to experiment with terabyte sized data, take a look at the Amazon Web Services (AWS) public data sets. You can set up an AWS account (free for the first year, I think) and work with these datasets through AWS.

6 Exercises

Read in the train.zip file directly without unzipping first using read_csv. Try to do the same with fread. I think you’ll need to insert a command line option into the call, e.g. 'zcat train.zip' on linux. I’m not sure how to do this on Windows, but let me know if you find a way.
Download a gz file (e.g., see the ‘downloading data in R’ tutorial) and read the gz file directly into R without decompressing first. For this, you will need to wrap the file name in gzfile, e.g. read_csv(gzfile(2010.csv.gz)).
Read the recs2009.csv file directly from the web by passing the url to read.csv, read_csv, and fread.

Note: I ran all time comparisons with R version 3.2.3 (2015-12-10) on an intel i7 running at 2.67 GHz with 8 GB RAM.

Computing workshop homepage

Managing large desktop-sized files