readr
Let’s load the data from the 2009 residential energy consumption survey (RECS). As described on the website, the 2009 RECS included 12,083 households selected at random using a complex multistage, area-probability sample design.
library(readr)
parent <- dirname(getwd())
dataPath <- file.path(parent, "data")
# download the data if it doesn't already exist
if(!file.exists(file.path(parent, "data", "recs2009.csv"))){
download.file(url = "http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv",
mode = "wb",
destfile=file.path(dataPath, "recs2009.csv"))
}
data <- read_csv(file.path(dataPath, "recs2009.csv"), progress = FALSE)
prob <- problems(data)
badCol <- unique(prob$col)
badCol
## [1] "GALLONFOOTH" "BTUFOOTH" "DOLFOOTH" "NUMCORDS" "BTUKERSPH"
## [6] "DOLKERSPH" "BTUKEROTH" "DOLKEROTH"
There were problems reading in eight columns. There were only 30 observations with problems, but before proceeding, we should figure out if these problems indicate larger data quality issues, especially if we are interested in the problematic columns. To start, we can consult the RECS documentation, and if necessary, we could get in touch with the Energy Information Administration (EIA) staff.
In this case, EIA provided a layout file. We can use it to get more information about the problematic columns. The rows of the layout file are in the same order as columns of recs2009.csv
.
# get the layout file
layout <- read.csv("http://www.eia.gov/consumption/residential/data/2009/csv/public_layout.csv")
str(layout)
## 'data.frame': 931 obs. of 5 variables:
## $ Variable.Name : Factor w/ 931 levels "ACOTHERS","ACROOMS",..: 126 407 123 416 517 335 222 87 221 86 ...
## $ Variable.Label : Factor w/ 930 levels "2009 gross household income",..: 875 65 64 767 867 166 217 87 218 88 ...
## $ Variable.Order.in.File: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Variable.Type : Factor w/ 2 levels "Character","Numeric": 1 2 2 2 2 2 2 2 2 2 ...
## $ Length : int 5 8 8 8 8 8 8 8 8 8 ...
badColInfo <- layout[which(layout$Variable.Name %in% badCol),]
badColInfo
## Variable.Name
## 725 NUMCORDS
## 885 GALLONFOOTH
## 889 BTUFOOTH
## 893 DOLFOOTH
## 899 BTUKERSPH
## 901 BTUKEROTH
## 903 DOLKERSPH
## 905 DOLKEROTH
## Variable.Label
## 725 Cords of wood used in 2009 (if more than 5)
## 885 Fuel Oil usage for other purposes (all end-uses except SPH and WTH), in gallons, 2009
## 889 Fuel Oil usage for other purposes (all end-uses except SPH and WTH), in thousand BTU, 2009
## 893 Cost of Fuel Oil for other purposes (all end-uses except SPH and WTH), in whole dollars, 2009
## 899 Kerosene usage for space heating, in thousand BTU, 2009
## 901 Kerosene usage for other purposes (all end-uses except SPH and WTH), in thousand BTU, 2009
## 903 Cost of Kerosene for space heating, in whole dollars, 2009
## 905 Cost of Kerosene for other purposes (all end-uses except SPH and WTH), in whole dollars, 2009
## Variable.Order.in.File Variable.Type Length
## 725 725 Numeric 8
## 885 885 Numeric 8
## 889 889 Numeric 8
## 893 893 Numeric 8
## 899 899 Numeric 8
## 901 901 Numeric 8
## 903 903 Numeric 8
## 905 905 Numeric 8
I’ve never worked with this data before, but ‘Cords of wood used in 2009 (if more than 5)’ sounds like it should be restricted to be a non-negative integer. I suspect that most of the data are ok and that there are just a few problematic values, but I would look into it further before doing a serious analysis.
Assuming that we determined that the data were ok, let’s use the layout file to tell readr
how to read the data properly. See vignette("column-types")
for more information about column types (enter vignette("column-types")
into an R session after loading the readr
package).
colTypes <- as.character(layout$Variable.Type)
colTypes <- gsub("Numeric", "d", colTypes) # d for double
colTypes <- gsub("Character", "c", colTypes) # c for character
colTypes <- paste(colTypes, collapse = "") # concatenate into 1 string
data <- read_csv(file.path(dataPath, "recs2009.csv"),
col_types = colTypes,
progress = FALSE)
This is one way to solve the problem, though now every column is a double, so the data take up more space. This isn’t a big dataset, so it’s not a problem, but in other cases you might want to specify all the numeric columns as integers, except for the eight problematic columns.