118th Congress District Shapefiles

I’m a big fan of minimizing external dependencies and one of the moments forming my opinion on this was dealing with mapping user location to Congressional district for 5 Calls. This is a key part of 5 Calls: you enter an address or zip code and we return a set of representatives for various levels of government, including their phone numbers and various metadata that’s useful to you.

The very first version of 5 Calls used the Google Civic API to fetch this data which worked pretty well and included a geocoder so we could pass addresses, zip codes, etc and get back a description of the federal representatives for that point. This worked OK and there was a generous free tier but it was still an external API call adding to request latency and the service was less than responsive to changes in representative information, especially one-off changes that happen outside of election cycles.

Eventually we moved to a different service that was a hobby project for another civic tech-minded programmer, but it ended up being overly complex and, being a hobby project, was even less up-to-date with the latest changes in Congressional representation. It did use elasticsearch though, which had decent support for geospatial queries so I spun out using elasticsearch by itself for a while, adding some tools to spin up a dataset of district information from geojson files.

Elasticsearch was fast enough, but still an external service (not to mention an expensive one) that we needed to make an API call to before returning representative information. One day whilst fighting an upgrade to a new version and the AWS console all in one battle, I wondered how many polygons I could just fit in RAM and query using basic point-in-polygon algorithms. And from my experimentation it turns out I could store all of the congressional districts easily in RAM (simplified, but acceptably so) and query them in much less time that an external API call took.

This simplified approach has been working great for the last few years: download district data from the unitedstates/districts repo on startup, then when a request comes in geocode an address or zip code and figure out which polygon it’s in. As is typical in programming, I thought my options were systems that optimized for searching thousands or tens of thousands of polygons when in reality I only needed to pick from ~450.

We’ve had a handful of states redistricting over the last few years which I had to handle individually, but the real test was the start of the 118th Congress when new districts from the 2020 census came into effect. Most states had their district boundaries modified in some way as the population distribution moved around in the state, and if a state changed population enough to either gain or lose a House seat the district boundaries could be significantly different as they needed to either make room for a new district or absorb the former population from a lost one.

I spent a couple weeks digging up what tools to use and how to validate the new districts in a way that would let me manage all 50 states without doing too much manual work, here’s my process:

1. Aquire Shapefiles

All states will produce a district shapefile (a format managed by Esri, one of the major GIS companies) and sometimes geojsons or KML files, shapefiles were the common denominator so I only used those regardless of what else a state offered for download. Generally the congressional district shapefile is available on a legislature or state court website first, then eventually the census website. This part takes some googling.

2. Split And Convert

We (rather, the folks who run the unitedstates/districts repo) want each district in its own geojson file… alongside a KML file that has exactly the same info, but we’re interested in the geojson format for our own usage. Seeking to convert from a shapefile to geojson file leads to a number of tools and paths but a simple yet robust option seemed to be the mapshaper tool.

Combining a number of our tasks into one command, we can split a big shapefile into individual district geojson files, simplifying the paths and slimming the overall file size by reducing the precision by using this command:

mapshaper -i GA.zip -split -simplify 15% -o ga/ format=geojson precision=0.000001

Our input, GA.zip here, is four shapefile components, dbf, prj, shp, and shx files all zipped up into one archive. mapshaper is really powerful! I was surpised I could do so much with just running one command and there are lots of options for processing shape formats in various ways that I didn’t end up using.

Simplification reduces the amount of points in the shape to a percentage of the original points, with some heuristics to maintain shape detail when possible. I tried to simplify into a similar filesize as before, i.e. if all of Alabama’s geojsons were ~500kb previously, I tried to hit that number again with the assumption that anyone currently reading the files into memory would be able to do the same with these updated shapes. Some of the sources are quite large and leaving them unsimplified would surely break some implementations that depend on this data.

I could probably use a more rigorous approach as to how complex the shapes should be for the purpose but in the absence of that, this seemed like the best way to aim for a particular size.

Reducing the precision to 6 decimal places means that we can only tell distances down to a tenth of a meter but that seems like a fair tradeoff for our usecase as well.

Sometimes this complains (warns, but doesn’t fail) on there not being a projection available. If you miss it during this pass, you’ll definitely notice the very large floats as points in your geojsons later. The solution is to force the wgs84 projection with -proj wgs84 as part of the mapshaper command.

3. Validate

Now the nitty-gritty. How had each of these states formatted their files? Did they include relevant metadata for their districts? We needed to be sure that we had the right districts mapped to the right representatives across ~450 files without doing everything by hand - as well as creating folders and files in the correct place for the repo that we are contributing to1.

There’s no great way around this: I had to parse the json, being flexible for various ways states had described their districts, and then reformat them correctly before writing out the files. Go is not a great choice for this given its somewhat strict JSON parsing behavior but I can always churn out some Go without much thought to syntax or special libraries so I picked that.

This mostly went without drama. I did assume originally that the shapefiles listed each district sequentially and numbered them as such before realizing that is absolutely not a good assumption and going back to parse whatever district number was in each shape metadata. The only hangup here was Michigan which for some reason misnumbered its districts in the original shapefile.

The code in question is in my fork of the district repo (it probably will not be merged to the original one) and can be run with go run . -state GA.

4. Add KML

The repo wanted KML files sitting alongside the geojson files so I had to figure out how to generate KML files from geojson files. Unfortunately KML is not supported by mapshaper so I had to look elsewhere. One of the other options that I had originally considered for converting shapefiles originally was ogr2ogr from the GDAL library. It didn’t have the processing options I was looking for but it could easily turn a geojson file into a KML file, so a little bash was able to convert all the district files for each state:

for i in {1..8}
    ogr2ogr WI/WI-$i/shape.kml WI/WI-$i/shape.geojson

Other than a couple minor fixes for importing the files into the 5 Calls API during startup, that was the whole processing pipeline for all 50 states’ worth of district files. Most states went smoothly through all the steps without any manual intervention but naturally the states with weirdness took a bit of time to work out the special cases.

I’m pretty happy now with both the way we consume the files as well as how they’re processed. I could easily redo this again in ten years (!!!) and I imagine I’d only have to make minor changes.

  1. [1] unitedstates/districts is supposed to be CC0-licensed data, i.e. reformatting of free data published by the government itself. I didn’t get all my data from sources that would be OK with republishing so I’ll wait until the census publishes the shapefiles before I submit something that can be PR-able to the original repo. ↩︎