The modal user-facing workflow we are aiming for:
d_addr |> # user's data, contains addresses
gc_address(address, zip=zip_code, state="PA") |>
gc_prep_street_db() |>
gc_code_pt() # or gc_code_block()
Internal (see sysdata.R
)
states
: FIPS lookup (full name, postal code, and AP abbreviation)counties
: FIPS lookupstreet_dirs
: standardized street direction lookupstreet_types
: standardized street type lookup
Larger internal (see city_zip_county.R
)
city_zip_county.rds
: ZIP/county/city combinations table built from HUD crosswalks, ZIP database, and USGS GNIS.city_regex.rds
: trie-ified regex matching all city names incity_zip_county
table
Regular expressions
We programatically build regexes for address parsing.
These are pre-built in build_regex.R
, with some higher-level regexes constructed on-the-fly.
Larger regexes are trie-based regexes because we are matching many options.
See also city_regex.rds
above.
User-facing (see nc_addr.R
)
nc_addr
: Random 1,000 addresses from Dare County, NC voter file
See address.R
; test-address.R
Stage input: data frame with column(s) containing unparsed addresses
Stage output: tibble containing addresses standardized and parsed into columns
- Check for ZIP codes, first from
zip
argument and then by regex on generic address column - Check for state names, first from
state
argument and then by regex on generic address column - Check for city names, first from
city
argument and then from dictionary-based regex at end of generic address column - Parse street name, type, prefixes and suffixes, and house and unit numbers from remainder of address column
- Standardize county name and code, if provided in
county
argument
For all steps, if we parse a component from the generic address column, we remove that component before the next step. Thus "1 OXFORD ST CAMBRIDGE MA 02138" becomes, in order:
- "1 OXFORD ST CAMBRIDGE MA 02138"
- "1 OXFORD ST CAMBRIDGE MA"
- "1 OXFORD ST CAMBRIDGE"
- "1 OXFORD ST"
See prep.R
; test-prep.R
Stage input: tibble containing addresses standardized and parsed into columns
Stage output: same as input (invisibly). As side effect, downloads and organizes Census street and address data
- Build a list of county FIPS codes we need data for. This is a concatenation of:
- Counties in address tibble
- Counties crosswalked from ZIP codes
- TODO: if no counties or ZIPs, get all counties in state. Consider doing this anyway as xwalk may not be complete
- Download data for each county:
- Download Census
EDGES
,FACES
,ADDRFEAT
, andFEATNAMES
files - Subset to appropriate columns and parse appropriate types
- For
ADDRFEAT
, dedupe TLIDs/rows which have different address ranges (we don't use specific address ranges) - Save as parquet (or rds for the one geo column in
EDGES
)
- Download Census
See geocode.R
See geocode.R
See geocode.R