Assessing parsing errors from readr

Let’s load the data from the 2009 residential energy consumption survey (RECS). As described on the website, the 2009 RECS included 12,083 households selected at random using a complex multistage, area-probability sample design.

library(readr)

parent <- dirname(getwd())
dataPath <- file.path(parent, "data")

# download the data if it doesn't already exist
if(!file.exists(file.path(parent, "data", "recs2009.csv"))){
  download.file(url = "http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv",
    mode = "wb",
    destfile=file.path(dataPath, "recs2009.csv"))
}

data <- read_csv(file.path(dataPath, "recs2009.csv"), progress = FALSE)
prob <- problems(data)
badCol <- unique(prob$col)
badCol

## [1] "GALLONFOOTH" "BTUFOOTH"    "DOLFOOTH"    "NUMCORDS"    "BTUKERSPH"  
## [6] "DOLKERSPH"   "BTUKEROTH"   "DOLKEROTH"

There were problems reading in eight columns. There were only 30 observations with problems, but before proceeding, we should figure out if these problems indicate larger data quality issues, especially if we are interested in the problematic columns. To start, we can consult the RECS documentation, and if necessary, we could get in touch with the Energy Information Administration (EIA) staff.

In this case, EIA provided a layout file. We can use it to get more information about the problematic columns. The rows of the layout file are in the same order as columns of recs2009.csv.

# get the layout file
layout <- read.csv("http://www.eia.gov/consumption/residential/data/2009/csv/public_layout.csv")
str(layout)

## 'data.frame':    931 obs. of  5 variables:
##  $ Variable.Name         : Factor w/ 931 levels "ACOTHERS","ACROOMS",..: 126 407 123 416 517 335 222 87 221 86 ...
##  $ Variable.Label        : Factor w/ 930 levels "2009 gross household income",..: 875 65 64 767 867 166 217 87 218 88 ...
##  $ Variable.Order.in.File: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Variable.Type         : Factor w/ 2 levels "Character","Numeric": 1 2 2 2 2 2 2 2 2 2 ...
##  $ Length                : int  5 8 8 8 8 8 8 8 8 8 ...

badColInfo <- layout[which(layout$Variable.Name %in% badCol),]
badColInfo

##     Variable.Name
## 725      NUMCORDS
## 885   GALLONFOOTH
## 889      BTUFOOTH
## 893      DOLFOOTH
## 899     BTUKERSPH
## 901     BTUKEROTH
## 903     DOLKERSPH
## 905     DOLKEROTH
##                                                                                    Variable.Label
## 725                                                   Cords of wood used in 2009 (if more than 5)
## 885         Fuel Oil usage for other purposes (all end-uses except SPH and WTH), in gallons, 2009
## 889    Fuel Oil usage for other purposes (all end-uses except SPH and WTH), in thousand BTU, 2009
## 893 Cost of Fuel Oil for other purposes (all end-uses except SPH and WTH), in whole dollars, 2009
## 899                                       Kerosene usage for space heating, in thousand BTU, 2009
## 901    Kerosene usage for other purposes (all end-uses except SPH and WTH), in thousand BTU, 2009
## 903                                    Cost of Kerosene for space heating, in whole dollars, 2009
## 905 Cost of Kerosene for other purposes (all end-uses except SPH and WTH), in whole dollars, 2009
##     Variable.Order.in.File Variable.Type Length
## 725                    725       Numeric      8
## 885                    885       Numeric      8
## 889                    889       Numeric      8
## 893                    893       Numeric      8
## 899                    899       Numeric      8
## 901                    901       Numeric      8
## 903                    903       Numeric      8
## 905                    905       Numeric      8

I’ve never worked with this data before, but ‘Cords of wood used in 2009 (if more than 5)’ sounds like it should be restricted to be a non-negative integer. I suspect that most of the data are ok and that there are just a few problematic values, but I would look into it further before doing a serious analysis.

Assuming that we determined that the data were ok, let’s use the layout file to tell readr how to read the data properly. See vignette("column-types") for more information about column types (enter vignette("column-types") into an R session after loading the readr package).

colTypes <- as.character(layout$Variable.Type)
colTypes <- gsub("Numeric", "d", colTypes) # d for double
colTypes <- gsub("Character", "c", colTypes) # c for character
colTypes <- paste(colTypes, collapse = "") # concatenate into 1 string

data <- read_csv(file.path(dataPath, "recs2009.csv"),
  col_types = colTypes,
  progress = FALSE)

This is one way to solve the problem, though now every column is a double, so the data take up more space. This isn’t a big dataset, so it’s not a problem, but in other cases you might want to specify all the numeric columns as integers, except for the eight problematic columns.

Computing workshop homepage

Assessing parsing errors from `readr`

Feb 12, 2016