To follow along, please download the datasets in the previous tutorial, and place all downloaded files in a folder called data
at the same level as the working directory. Alternatively, you could change dataPath
in the first code block to point to the location of the data. You could also clone the repository and run the download_files.R
script in the data
folder.
Managing files from within R can be useful in some situations, like making reproducible reports. Let’s see what’s in our data folder.
parent <- dirname(getwd())
dataPath <- file.path(parent, "data")
list.files(dataPath)
## [1] "download_files.R" "public_layout.csv" "recs2009.csv"
## [4] "train.csv" "train.zip"
We’re only interested in the csv files, so let’s just look at those. We’ll save the file names so we can use them later.
csvFiles <- list.files(dataPath, pattern = "*.csv$")
csvFiles
## [1] "public_layout.csv" "recs2009.csv" "train.csv"
Let’s see how large the files are. We’re going to work with the recs2009.csv
file, which contains data from the 2009 residential energy consumption survey (RECS).
setwd(dataPath)
info <- file.info(path = csvFiles)
info
## size isdir mode mtime
## public_layout.csv 59883 FALSE 666 2016-02-11 19:55:02
## recs2009.csv 27460827 FALSE 666 2016-02-12 10:31:46
## train.csv 1942848724 FALSE 666 2016-02-12 10:45:43
## ctime atime exe
## public_layout.csv 2016-02-06 18:45:11 2016-02-11 19:55:02 no
## recs2009.csv 2016-02-06 18:50:06 2016-02-06 18:50:06 no
## train.csv 2016-02-12 10:44:48 2016-02-12 10:44:48 no
size <- info["recs2009.csv", "size"]
size/1e6
## [1] 27.46083
So the recs2009.csv
file is 27 MB – a very manageable size.
Base R functions and data structures, such as read.csv
and data.frame
, are fine for most datasets. However, for larger datasets, the readr
and data.table
packages are noticeably faster.
# install.packages(c("readr", "data.table"))
library(readr)
library(data.table)
At 27 MB, the recs2009.csv
file isn’t large. However, if we were working with many files of this size, readr
and data.table
would save a lot of time.
# Compare read-in times
t_base <- system.time(data <- read.csv(file.path(dataPath, "recs2009.csv")))[3] # base R
t_fread <- system.time(data <- fread(file.path(dataPath, "recs2009.csv"), showProgress = FALSE))[3] # data.table
t_readr <- system.time(data <- read_csv(file.path(dataPath, "recs2009.csv"), progress = FALSE))[3] # readr
## Warning: 30 parsing failures.
## row col expected actual
## 1538 GALLONFOOTH no trailing characters .724
## 1538 BTUFOOTH no trailing characters .732
## 1538 DOLFOOTH no trailing characters .373
## 3628 NUMCORDS no trailing characters .5
## 3981 GALLONFOOTH no trailing characters .681
## .... ........... ...................... ......
## .See problems(...) for more details.
compare <- c(t_base, t_readr, t_fread)
names(compare) <- c("read.csv", "read_csv", "fread")
# times relative to read.csv
signif(compare / compare[1], 2)
## read.csv read_csv fread
## 1.00 0.17 0.06
barplot(compare, ylab = "time (sec)",
main = paste("Time to read in ", round(size/1e6),
" MB csv file", sep = ""))
We downloaded the recs2009.csv
file earlier to save time, but you can also pass the url directly to read.csv
, read_csv
, and fread
.
read_csv
had trouble parsing some records. The tbl_df
object records these problems as an attribute.
head(problems(data))
## Source: local data frame [6 x 4]
##
## row col expected actual
## (int) (chr) (chr) (chr)
## 1 1538 GALLONFOOTH no trailing characters .724
## 2 1538 BTUFOOTH no trailing characters .732
## 3 1538 DOLFOOTH no trailing characters .373
## 4 3628 NUMCORDS no trailing characters .5
## 5 3981 GALLONFOOTH no trailing characters .681
## 6 3981 BTUFOOTH no trailing characters .778
Before analyzing the data, we would want to look at the original file and figure out the problem.
Note: read-in times can vary. For a more complete comparison, you could do a simulation study (read in the same data many times to get the means, quantiles, and standard deviations of the read-in times).
Now let’s see what happens with a larger file.
size <- info["train.csv", "size"]
size/1e6
## [1] 1942.849
The train.csv
file contains the trajectories of 442 taxis in the city of Porto, Portugal for one complete year (from 7/1/2013 to 6/30/2014). The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) hosted a competition through Kaggle with this dataset last summer. The data are publicly available on the UCI Machine Learning Repository.
At 1,943 MB, the train.csv
file is too large for read.csv
. The following code crashed my computer, which has 8 GB of memory:
# warning: the following might crash your computer
# read.csv(file.path(dataPath, 'train.csv'))
However, we can still read in the file with readr
and data.table
. We can use the rm
and gc
functions to remove large files and return the memory to the operating system (gc for garbage collection). Typically, we only need to call rm
(see Hadley Wikham’s Advanced R). However, when I monitor memory use with the windows task manager, I don’t see a decrease until after I call gc
. I don’t know if it helps, but it doesn’t hurt, so sometimes I call gc
to be safe.
system.time
calls gc
by default, so we don’t need gc
after the first rm
.
To keep the comparison below fair, I specified the column types. If you don’t specify column types with this example, read_csv
does a better job at guessing which columns are integers. fread
guessed that everything was a character string, and actually took longer to read in the data.
t_readr <- system.time(data <- read_csv(file.path(dataPath, 'train.csv'),
col_types=cols(
TRIP_ID = col_character(),
CALL_TYPE = col_character(),
ORIGIN_CALL = col_integer(),
ORIGIN_STAND = col_integer(),
TAXI_ID = col_integer(),
TIMESTAMP = col_integer(),
DAY_TYPE = col_character(),
MISSING_DATA = col_character(),
POLYLINE = col_character()),
progress=FALSE)
)[3]
rm(data)
t_fread <- system.time(data <- fread(file.path(dataPath, 'train.csv'),
colClasses = c("character", "character", rep("integer", 4), rep("character", 3)),
showProgress = FALSE)
)[3]
rm(data)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 421542 22.6 5309345 283.6 5565496 297.3
## Vcells 4782281 36.5 290816354 2218.8 265207694 2023.4
compare <- c(t_readr, t_fread)
names(compare) <- c("read_csv", "fread")
barplot(compare, ylab = "time (sec)",
main = paste("Time to read in ", prettyNum(round(size / 1e6), big.mark = ","), " MB csv file", sep = ""))
fread
from data.table
is typically faster than read_csv
from readr
. fread
creates a data.table
instead of a data.frame
, which uses different syntax and doesn’t work as well with dplyr
. I think the syntax is fine, though, and operations with data.table
are fast. If you need to convert a data.table
to a data.frame
, you can use the as.data.frame
function. See Hadley Wickham’s github page on readr
for more information on readr
.
There are several options for working with datasets too large to fit into memory. As one quick approach, you can use gawk command line tools from within R to read in a subset of the data. This can help you to get a sense for the data before implementing another computing solution, some of which I mention below. We’ll use the train.csv
dataset, and treat it as if it were too big for memory.
setwd(dataPath)
# read in every 250th line
filePipe <- pipe("gawk 'BEGIN{i=0}{i++; if(i%250==0) print $1}' < train.csv")
system.time(train <- read.table(filePipe, sep = ","))
## user system elapsed
## 4.12 17.12 35.35
header <- read.csv("train.csv", nrow = 1)
colnames(train) <- colnames(header)
str(train)
## 'data.frame': 6842 obs. of 9 variables:
## $ TRIP_ID : num 1.37e+18 1.37e+18 1.37e+18 1.37e+18 1.37e+18 ...
## $ CALL_TYPE : Factor w/ 3 levels "A","B","C": 3 2 2 3 1 3 2 2 3 2 ...
## $ ORIGIN_CALL : int NA NA NA NA 14558 NA NA NA NA NA ...
## $ ORIGIN_STAND: int NA 52 61 NA NA NA 15 NA NA 23 ...
## $ TAXI_ID : int 20000653 20000076 20000341 20000320 20000432 20000598 20000268 20000451 20000455 20000463 ...
## $ TIMESTAMP : int 1372649868 1372662088 1372665287 1372664244 1372669608 1372673543 1372672664 1372678992 1372670258 1372678111 ...
## $ DAY_TYPE : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
## $ MISSING_DATA: Factor w/ 1 level "False": 1 1 1 1 1 1 1 1 1 1 ...
## $ POLYLINE : Factor w/ 6820 levels "[[-7.787205,41.088267],[-7.783578,41.090211],[-7.782561,41.093874],[-7.781742,41.096961],[-7.779546,41.099697],[-7.776018,41.10"| __truncated__,..: 6084 3482 1224 2028 5796 822 707 4074 2990 3296 ...
Gawk is a GNU implementation of awk, a command line file processing tool for Unix. For more information, see the gawk home page. Kerby Shedden’s STAT 506 class notes also have some pointers for using awk.
This is a quick way to peak at large datasets with R. However, the subset might be a biased sample from the full dataset, especially if the variables (columns) have non-random trends across the observations (rows). In some cases, you might be able to repeat the same analysis with multiple samples, and then aggregate the results. Ariel Kleiner, et al.’s Bag of Little Bootstraps is an example of this approach. I think it was developed with a distributed computing system in mind, but you could also implement it on desktop.
Depending on your situation, you might also consider:
rhdf5
package (see Kerby Shedden’s STAT 506 class notes)dplyr
or another package (see Hadley Wickham’s vignette for connecting dplyr to remote databases). ff
is another package that works with data stored on disk. To fully take advantage of these options, it would help to have a solid state drive (SSD).bigmemory
package to more efficiently store matrices in memory.train.csv
dataset is in JSON format. You can use an R package, such as json
, to parse this variable.train.zip
file directly without unzipping first using read_csv
. Try to do the same with fread
. I think you’ll need to insert a command line option into the call, e.g. 'zcat train.zip'
on linux. I’m not sure how to do this on Windows, but let me know if you find a way.gzfile
, e.g. read_csv(gzfile(2010.csv.gz))
.recs2009.csv
file directly from the web by passing the url to read.csv
, read_csv
, and fread
.Note: I ran all time comparisons with R version 3.2.3 (2015-12-10) on an intel i7 running at 2.67 GHz with 8 GB RAM.