readr::problems() returns tidy data!

A handy little trick I picked up today when using readr.

Some background: I needed a mapping between ZIP Code Tabulation Areas and counties (to link to some urban/rural data). The Census Bureau provides a CSV style table that includes information about each of the ZCTA (e.g., size, population, area by land/water type) and the FIPS codes for the state and county.

However, when I load that data using the readr package:

library(tidyverse)
zcta_to_county_mapping <- read_csv("http://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt") %>%
  select(ZCTA5, STATE, COUNTY) %>%
  mutate(STATE = as.numeric(STATE),
         COUNTY = as.numeric(COUNTY))
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   ZCTA5 = col_character(),
##   COUNTY = col_character(),
##   COAREA = col_double(),
##   COAREALAND = col_double(),
##   ZPOPPCT = col_double(),
##   ZHUPCT = col_double(),
##   ZAREAPCT = col_double(),
##   ZAREALANDPCT = col_double(),
##   COPOPPCT = col_double(),
##   COHUPCT = col_double(),
##   COAREAPCT = col_double(),
##   COAREALANDPCT = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 1592 parsing failures.
##  row        col   expected     actual
## 1303 ZAREA      an integer 3298386447
## 1303 ZAREALAND  an integer 3032137295
## 1304 AREAPT     an integer 2429735568
## 1304 AREALANDPT an integer 2262437812
## 1304 ZAREA      an integer 3298386447
## .... .......... .......... ..........
## See problems(...) for more details.

It produces a warning. Looking at the few rows it returned, it seems likely that the errors are coming from overflow - read_csv() guessed that the variable was of type int (8 bytes, max value of \(2^31 - 1\) or 2,147,483,647) byt some of these values are huge. I looked up a few of them and saw that they were all occuring in large, unpopulated areas. One of them (ZIP code 04462) is described by UnitedStatesZipCodes.org as covering “an extremely large land area compared to other ZIP codes in the United States.”

So that seems like the source of the issue - but there were 1,592 failures! I want to make sure those failures never affect the variables that I’m interested in. I noticed the error message says to use problems() to see more details. I did as it was told, expecting something about as useful as the results of warnings() but was pleased to get get back a tbl_df!

Checking to make sure the errors didn’t affect my variables of interest (ZCTA5, STATE and COUNTY) was as easy as

problems(zcta_to_county_mapping) %>%
  filter(col %in% c("ZCTA5", "STATE", "COUNTY"))
## # A tibble: 0 × 4
## # ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

I love when tools make life easier! Even the error handling returns tidy data!

Tags: r statistics quick

Related Posts