Portfolio

An (almost) comprehensive list of things I've built

Natural Capital Exchange (NCX)

Automated Stand Delineation

Part of our process in making recommendations to landowners involved delineating stand boundaries for their forested land. Typically, this is a fully manual process done by a forestor. I wrote an image processing pipeline that delineated stands using NAIP imagery, canopy height derived from lidar and various raster sources to approximate forest type/species and age of the trees. This rolled out initially as a semi-automated tool for our foresters to use, but eventually I built a fully automated pipeline that could delineate stands in under 10 seconds.

Automated Stand Delineation Example

Natural Capital Scores

Part of our product strategy at NCX was giving landowners personalized estimates of various natural capital assets on their land. This includes things like timber value, carbon storage potential and forest health. We explored several approaches to creating this data, from an FIA plot state transition model to eventually using the US Forest Service's own Forest Vegetation Simulator (FVS). I created a distributed pipeline using Dask on Coiled to run FVS across tens of thousands of FIA plots in under an hour. The end product is an interactive widget that allows landowners to experiment with different management strategies and see how this affects their natural capital scores.

Natural Capital Scores Example

Land Explorer App

NCX offers a personalized land report called "Natural Capital Advisor" or NCA. The goal of NCA is to provide personalized insights that align a landowner's goals with a feasible managment plan and offer a variety of potential programs to assist funding the implementation of this plan. This is a largely manual process, wherein a trained expert must analyze the property in order to make a recommendation. An earlier product we tested at NCX was a land explorer app, known internally as the "Atlas" App. This application aggregates a variety of sources related to natural resources and earth observation data and makes it available via an interactive front end. This did not resonate with users in initial testing, but I eventually repurposed the Atlas App to assist us internally with assessing a landowner's property for NCA reports. The Atlas App includes everything from Timber volume and price estimates, to in depth soil analysis, localized flora and fauna prevalence.

Atlas App Example

Lidar Point Cloud Maps

One experiment I ran at NCX was generating compelling 3D visuals using openly available Lidar data from USGS. I built a simple pipeline using PDAL to process the Lidar, including colorizing using NAIP or Hexagon aerial imagery. I also experimented with visualizing the point clouds in Deck.gl. Pictured here is a collage of some of the maps I generated, including Tower Grove Park in St. Louis.

Collage of Example Lidar Maps

Loss Detection

A critical component of land analysis is loss detection. At NCX, we attempted many varieties of loss detection, from our own in-house method to using the OPERA dataset from NASA.

Example Loss Detection Map

Timber Value Estimates

In valuing natural capital, timber serves as an important baseline for comparison. I built a timber pricing model, scraping numerous data sources for timber prices and combining this with volume estimates to generate accurate estimates of timber value. We eventually sourced a number of proprietary data sources to provide timely information about changes in timber value with the market as well.

Example Timber Value Estimates

Land Assessment Platform (NCAPI)

With the launch of NCX 2.0, focusing on matching landowners with available opportunities for natural capital developments, we needed to build an entirely new pipeline to stream assessment results to the end-user platform. We built a FastAPI application with a Celery backend, fully integrated with all of the microservices from the eoAPI stack developed by DevSeed. This brought tons of visualization features to the platform as well as intelligent cataloging and management of raster data. I fully migrated and deprecated our Databricks implementation, savings us thousands of dollars per month in compute costs. Developed with Henry Rodman.

NCAPI Architecture Diagram

Spark Raster Pipeline (CONUSDB)

As we had been migrating an increasing amount of our data to Spark on Databricks, we needed a way to efficiently process and analyze raster data at scale. I built a large pipeline in Spark that allowed us to ingest, process and analyze raster data across the continental United States. We explored a number of raster interpolation and storage methods, ultimately implementing our own solutions before the maturity of raster solutions in Apache Sedona or Databrick Mosaic. An interesting memo I came across while developing this:"A Pixel is Not a Square"

Sketch of Raster Interpolation Methods

Parcel Ownership Reconciliation

Working with continental scale parcel data was a neccessity for some of the types of analyses we wanted to perform. Crucially, we needed to know the full land holdings of an individual entity across the United States. We had around 150 million individual property records and we wanted to be able for any individual owner to query every property under their ownership. To achieve this, I used a variety of natural language processing techniques and a network simplification method to build a CONUS-wide dataset of all properties owned by a single entity.

Automated Docker Lifecycle / Debugging

One of the pain points of working on a mid-sized engineering team was dependency management, rigourous testing and an automated deploy schedule. More than a dozen data scientists and engineers shared a single monolithic Docker image. Dependency requirements often conflicted, were often out of date and sometimes upgrades would cause silent failures. Several other engineers and I built a robust pipeline of Github Actions to improve dependency testing of new feature branches and handle routine upgrades automatically. Of particular impact was just 6 lines of code that allowed us to reproduce CI failures within Github Actions, allowing us to debug issues without needing to run the code locally.

Example Workflow for Interactive Debugging of CI

Crediting Methodology & Audit

Essential to the integrity of our Carbon crediting program at NCX was the the methodology we used to generate the actual carbon credits we offered for sale. I implemented much of the core crediting logic, including a full process to make intermediate data available for a third-party audit.

The crediting methodology as well as the creditr R package I implemented can be found here

Data Warehouse Migration

When I arrived at NCX, the standard procedure for recording data was files in cloud storage. This made it difficult for us to run large analytical workloads across datasets and further caused issues due to a lack of schema enforcement. One of my first large infrastructure projects at NCX was migrating us to a fixed schema in a single source of data. We implemented a short term solution with BigQuery but I later migrated the majority of our datasets into Spark on Databricks. My solution involved a thorough schema normalization process and the use of file streaming to automatically catalog new data generated while still maintaining our legacy processes.

Benchmarking Platform

One of our concerns as we scaled our carbon crediting pipeline at NCX was the integrity of the underlying models we used to predict and generate carbon credits. Specifically, we used a serious of various models in combination to predict harvest risk, prioritize ground sampling, detect loss events and project carbon yields. Given the initiative to benchmark each of these models for continuous improvement across future cycles, I was tasked with creating an MLOps system that could work across models in different environments. For this, I used MLFlow hosted on Databricks. I worked with the data scientist responsible for each model and created a set of metrics to be cataloged for each model training. I implemented the logic to generate and upload these metrics to our MLFlow instance and created an ETL pipeline to combine these into a Metabase dashboard for the team to review and guide future data investments.

Diagram of NCX Model Pipeline

Mill Pressure Model

The feasability of a timber harvest is largely constrained by the distances and availability of wood processing facilities for the harvested timbers. I constructed an isochrone model that considered the characteristics of log-hauling vehicles, road limitations and mill locations by wood product type and capacity. This was another key input to our harvest risk model.

Example Mill Isochrone Map

Timber Price Scrapers

Timber price estimates were an integral part of our harvest risk model for determining the economic pressure to harvest specific tree stands. I built a robust set of multi-modal scraping utilities to extract timber stumpage prices from a varierty of sources, including XML parsing, headless browser automation and PDF/Image OCR. Then, I created a standardized schema, complex unit transformations (there are as many as 87 standard measures for timber volumes!) and a pipeline for updating these data quarterly.

St. Louis Regional Data Alliance

Regional Data Exchange

The core mission of the RDA was to put data in the hands of stakeholders spanning from highly-technical academics and agencies to less-technical members of the public. In addition to the APIs that we released, we built an openly accessible portal that catalogs all known regional datasets.

Available at rdx.stldata.org

Homepage of the Regional Data Exchange

Open Data Commons

Data is powerful, but only in the hands of the right people. APIs are a powerful way for programmers to access data in an organized way, but for less technically proficient consumers, a REST API may be a barrier to data access. Recognizing this, we set out to develop the data commons. Just as Swagger and ReDoc have made the process of documenting REST APIs trivial, the Data Commons aims to deliver data in an easily consumable format (CSV) and also offer quick visualization of queried data.

Developed with Nico Stranquist, demo available here

Example Page of the Open Data Commons

St. Louis Vacancy Portal

The St. Louis Vacancy Portal was the ultimate culimination of years in developing a classification method and data streams to be able to produce routinely updated estimates of property vacancy and provide legal officials the resources neccessary to pursue abatement.

I built and maintain a scraper to ingest streams of property data from the City of St. Louis that feed into a custom web application developed by Dave Menninger

Actively Maintained: www.stlvacancytools.com

Vacancy Explorer App Screenshot

An Open Stack for Open Data

One challenge we encountered at the Regional Data Alliance was developing in the open with a low barrier to entry. We had a number of volunteers and external contributors who would often work short or unpredictable periods. At the Regional Data Alliance we developed in the open, including infrastructure, and wanted a repeatable technology pattern to work across projects and repositories. I presented our technical stack at a local technical meetup, and used this an as opportunity to recruit new contributors.

See the presentation here: An Open Stack for Open Data

Standard Parcel Data (REDB)

One of the first projects developed through the Regional Data Alliance was the so called "regional-entity database" or REDB. Lead initially by Johnathan Leek, I played a significant role in launching the first openly available dataset of standardized parcel data in the region.

We used this to generate our first estimates of vacancy, but later replaced it a solution built in collaboration with internal stakeholders at the City of St. Louis.

During the course of this project, I participated in an AI Accelerator program sponsored by DataKind and Microsoft.

Unmaintained: REDB Github

RDA Member Directory and Forum

At RDA, building community was a core tenant of our mission.

I built a member directory associated with a Discourse forum, where data alliance members could discuss their data needs, open roles, and events going on in the community. Not Active, Live Here

Homepage of RDA Member Directory and Forum

Regional Data APIs

One of the first things we built at RDA were a variety of APIs to access data like crime incidents, property vacancy and a varierty of demographic and health metrics. This later fueled our development of things like the data exchange, dashboards, and the open data commons.

Essential Worker Analysis

During the COVID-19 Pandemic, there was great concern surrounding the vulnerability of essential workers, those with occupations that cannot be performed remotely or without exposure to other people. Recognizing this, the Regional Data Alliance developed a workstream to analyze and disseminate findings about this population in Missouri and Illinois. The final product was a dashboard shared with a number of local agencies, community organizations and other members of the RDA. Developed with Theodore Moreland Demo Available Here

Banner from Essential Workers Dashboard

Washington University School of Medicine in St. Louis

Projecting COVID-19 hospitalizations

Perhaps the most important metric during the COVID-19 pandemic was frequency of hospitalization and total hospital utitlization. Measuring confirmed cases through tests alone is extremely prone to under-ascertainment. Hospitalizations, however, are nearly unavoidable and offer direct insight into the true severity of COVID-19 transmission. Hospital utilization also signals the most risk of mortality, both in the progression of COVID-19 infections and the capacity for healthcare systems to address other critical forms of medical treatment.

Developed with Joshua Schwab and numerous collaborators, I maintained a fork of the "LEMMA" model for the state of Missouri.

The model was updated weekly with inputs from all health systems across the state, aggregated by our partners at the Missouri Hospital Association, and used both to project future hospitalizations and simulate the effect of various public health interventions.

Model results were shared prominently with stakeholders across the state, including the Governor's office, the Missouri Hospital Association, the St. Louis Covid Task Force, local health agencies and the largest healthcare providers in the state, among others.

Demo Available

Example LEMMA Model Output

Statistical Measures of Equality

During COVID, there were many questions about an equitable distribution of vaccines and an inequitable distribution of cases and morbidity related to COVID infections. This motivated an inquiry into statistical measures of inequality and their interpretation for use in public health. I investigated a number of popular indices and even developed some novel methods of my own.

Demo Here and R Package Index Here

Time Series of Global Vaccine Equity Distribution

Spatial Correlates of Violent Crime

At the intersection of my research interests in violent crime and vacancy, I conducted an analysis using Risk Terrain Modeling to observe the relationship of vacant property on violent crime (homicide and aggravated assault).

We found vacancy to be among the most prodominant drivers of violent crime, and observed varying geographic effects of risk terrain modeling in constrained local models.

Demo Available and Publication Here.

Map of Risk Terrain Model of Violent Crime in St. Louis

Private Geospatial Infrastructure

Most of my research at the time used highly-sensitive patient records. In addition to implementing best practices for data security, I built a variety of private geospatial infrastructure. This included a custom built geocoder and road network routing solution to compute a variety of spatial metrics used in research with collaborators.

I originally developed both my own road-interpolation geocoder and weighted road networking graph simulator and later switched to more scalable open source solutions including Pelias and Graphhopper

Route Map of Individuals Seeking Covid-19 Vaccinations

LIVE Dashboard

A large research team compiled hundreds of implementation methods papers for addressing HIV in low and middle income countries. We wanted to disseminate this reserach in a highly interactive manner and I was tasked with building an app to make this happen.

The result is a highly interactive shiny dashboard that has been viewed by hundreds of researchers.

Actively Maintained:live.bransonf.com

Publication: JIAS

Homepage of LIVE Dashboard

Vaccine Distribution

The availability of effective COVID-19 vaccines in early 2022 presented a huge opportunity to begin ending the pandemic. It also presented a number of logistical challenges for a variety of agencies to handle.

As a consultant to the Missouri governor's office, my team and I provided a variety of resources and analytics in directing vaccine and testing distribution across the state.

We also worked with local health authorities to prioritize strategic vaccination sites.

Graphics from Covid Testing and Vaccine Distribution Analysis
Vaccine Site Prioritization Map

Scraping Hospital Price Data

A small, untapped research agenda of mine is understanding how the economics of hospital pricing influence the populations most served. The untested hypothesis is that higher prices or poor network benefits in more geographically proximal instituions result in upward pressure on lower priced medical facilities, which becomes a particular burden in seasonal periods of high utilization.

The difficulty in pursuing this research inquiry is the lack of transparency in hospital service price reporting, even depsite government regulation to do so.

I participated in a dolthub bounty and contributed a significant enough amount of data to put me on the leader board.

Bibliometric Network Analysis

A key component of dissemination and implementation (D&I) in science is promoting scientific findings through collaboratorative research efforts. To better understand the impact of this cross-collaboration, I was tasked with producing some analysis of D&I faculty at WashU and their relative contribution to the collaboratorative D&I network.

To achieve this, I first consulted with our research librarian, and then scraped several publishing databases to produce a list all of works published by select faculty and their collaborators.

With these data in hand, I applied a number of network analyses to measure key metrics like network centrality and weighted distance rankings. This identified a number of potential mechanisms for distributing grant funding and the importance of both intra- and inter-institutional collaboration in facilitating large, effective research networks.

Network Analysis of D&I Faculty

STL Crime Dashboard

Crime is something that has long been a public issue for the City of St. Louis and has fueled decades of divisive discourse. Understanding how we got here and what to do about it was the research inquiry that lead to the start of my career in public health.

I joined the Gun Violence Prevention Center at Washington University's Institute for Public health and used my skillset across a variety of research into crime, econometrics and violence prevention.

Among the earliest products to come out of this work was a public dashboard that allowed anyone to observe any cross-section of crime they were interested in, including custom maps, time series trends and environmental covariates of crime.

This was a major undertaking, building a process to scrape transform and normalize a variety of data and building a full stack application to interact with these data.

Unfortunately, the project became unmaintained when SLMPD stopped publicly releasing data at the end of 2020. Fortunately, after some pressure from a local news outlet, the city has resumed publicly releasing data. I may eventually revisit launching the crime dashboard in the future.

Homepage of the STL Crime Dashboard

Center Initiative Tracker

The Center for Dissemination and Implementation at Washington University in St. Louis was a newly established center during my time at WUSM.

As the leading technical member of the center, I was tasked with creating a visually appealing dashboard that we could share among stakeholders. Importantly, any member of the team needed the ability to update the dashboard in near real time.

To work within this constraint and to provide a familiar interface to all members of the team, I built this around a Google Sheet. This had the benefit of providing a single source of truth, updatable in real time that we could quickly facilitate adding or removing new members.

Demo Available

Access to Care Analyses

One theme of my public health research was using geospatial data to inform novel analyses relating to access to care, or the ability for individuals seeking treatment to receive prompt and appropriate care.

With my collaborators, we produced an analysis observing thousands of individuals seeking testing for sexually transmitted infections via the emergency department.

In our published findings, we find that remarkably almost a third of all care-seekers missed an opportunity to use a sexual-health clinic that was more geographically accessible and would have offered them free, walk-in services.

Access to Care Analysis Map

Contact Tracing App

Part of my responsibilities while at WUSM was working closely with our partners in the City's Department of Health. At the start of the pandemic, the DoH was struggling to process the rapid influx of data coming from hospitals, laboratories and other departments.

I built the initial prototype for transforming all of the data into a standard format before handing off my work to some volunteer consultants from a local firm.

HL7 Parser

As part of my efforts in developing the solution for contact tracing COVID-19 cases in St. Louis, Missouri, I built a custom HL7 parser in R to read the esoteric HL7 messaging format used by medical laboratories. Using this, we were able to identify thousands of positive test results days before the statewide reporting results.

Unmaintained: hl7r Github

Saint Louis University

Testing sensitivity to MAUP with Polyominoes

In unfinished work, I begin to develop a sensitivity analysis to the Modifiable Areal Units Problem (MAUP) using Polyominoes.

This unconvential approach to producing a randomized sampling kernel using Polyominoes offered a robust stochastic method both for the measurement of MAUP sensitivity as well as the re-allocation of aggregate spatial data for further analysis. It's main shortcoming, however, was the massive ammounts of compute required to scale the kernel size over larger or higher resolution geographic data.

Made Possible by Matt Busche's Polycube program

You can also see some of my experiments with Polyomino-packing on this now-archived homepage of my website.

A Mosaic of the United States using Pentominoes

Patterns of Civic Engagement

Part of my research as an undegraduate involved exploring new sources of data for analyzing and understanding cities. This included 311 data, or requests for non-emergency city services. In the City of St. Louis, this is known as the Citizen's Service Bureau, or CSB.

One inquiry that I had about these data was how representative they are of localized civic engagement. The hypothesis was that areas of low voter participation would also have a low utilization rate of the CSB.

The results were remarkably well aligned with my hypothesis, showing a high correlation of voter participation with CSB utilization, and the effect size was greater in local elections compared to the presidential election.

Building an Open Source Geocoder

Geocoding, the act of converting an address string into a numeric location, was a central part of my education in GIS and a recurrent neccessity in my research. After having dealt with a number of nuanced datasets that required specific preprocessing to achieve desriable geocodes, Chris Prener and I decided to build our own.

We built our own geocoder in R and compared it with a number of commercial geocoding options available at the time, including several free sources, and published our findings in Trasactions in GIS.

Example Geocoder Output Comparison

areal R package

The Modifiable Areal Units Problem (MAUP) was a key component of my studies and research in GIS.

With the areal R package, Chris Prener and I attempt to implement a variety of methods for areal weighted interpolation using the spatial features (SF) framework in place of the older SP framework for spatial data in R.

Dasymetric, Pycnophylactic, Regression and Hybrid methods were implemented for both spatially intensive and extensive data types.

Currently Maintained by Chris Prener: areal Github

Example Dasymetric Interpolation Process

Creating a Definition of Vacancy

St. Louis, Missouri has experienced one of the most salient urban declines in the country, resulting in a rise of property vacancy. It was long suspected this was pivotal in a variety of worsened outcomes, but a clear definition of vacancy and the corresponding number and location of vacant properties was unknown. I played a foundational role in building the defintion, data infrastructure and dissemination of the number and locations of vacant properties in St. Louis.

I became heavily involved in the St. Louis Vacancy Collaborative including the creation and implementation of the Vacant Property Methodology.

Equity in Historic Mortgage Lending

In my undergraduate honor's thesis, I explore the historic demographic shift of the City of St. Louis through a lens of mortgage lending incentives and practices. I analyze Home Mortgage Discloure Act (HMDA) data over time in addition to various demographic and econometric sources and elucidate a clear effect of divestment on future demographic and economic decline.

Photo of a Vacant Property in St. Louis (Branson Fox, 2017)

censusxy R package

While doing a variety of research with large datasets, such as county voter files or police dispatch records, geocoding became a complicated and expensive process. Often, the potential cost of geocoding these records made certain inquiries infeasible from the start. At some point, I discovered the US Census Bureau's public geocoding service. Soon after, I developed an R package to automatically geocode dataframes using this API.

Currently Maintained by Chris Prener: censusxy Github

Scooter Distribution

As part of the permitting process to allow dockless vehicle (bicycle & scooter) companies to operate in the City of St. Louis, companies are required to park 16% of their fleet in designated equity zones.

After observing a heavy concentration of vehicles in the more-affluent central cooridor of the City, I was determined to validate whether these equity targets were in-fact being met.

After being declined access to the data from a number of Rideshare companies, I reverse-engineered their APIs and logged vehicle locations for several weeks.

After collecting several weeks of data, I was confident that the operating companies were very loosely adhering to this stipulation in the permit. Neither local news outlets nor the City Traffic Division were interested in evaluating my data.

GIF of Scooter Distribution in Equity Zones Over Time

Barriers to Emergency Response

In unpublished work titled "Detours" I investigate the impact of street closures, meant to deter crime, on potential emergency response times.

I created two methodical road network models representing the City of St. Louis with and without these barriers, including speed limits and turn penalties. I then ran simulated routes originating from various emergency service locations and reaching every building footprint in the City.

After completing my undergraduate studies I lost access to the license neccessary to use the road network data and I stopped pursuing this line of research.

This built on the work of Chris Prener's street closure research