Benchmarking U.S. Water Utilities with EPA Data
This project focuses on benchmarking U.S. water utilities using data from the Environmental Protection Agency (EPA). The goal is to analyze performance metrics to identify best practices and areas for improvement in water service delivery. By leveraging data, the project aims to enhance operational efficiency, promote sustainability, ensure compliance with regulatory standards.

Introduction
Access to clean and affordable drinking water is one of the most essential public services in the U.S. However, measuring the performance of utilities, ensuring compliance, and assessing affordability have always been challenging. While public datasets are available, such as the EPA’s Safe Drinking Water Act (SDWA) reporting, they are fragmented across various systems, including violation records, geographic areas, and enforcement data.
At DataEngite, we built a U.S. Water Utility Benchmarking Engine that ingests EPA datasets, normalizes them, and produces a consolidated utility-level benchmark with metrics on compliance, population served, water source types, median household income, and affordability of bills.
The Data Sources We Integrated
Our workflow consolidates several public datasets:
-
EPA SDWA Public Water Systems – Utility IDs, names, population served, and source type (surface water vs. groundwater).
-
EPA SDWA Violations & Enforcement – Historical compliance violations, including health-based standards, monitoring violations, and enforcement actions.
-
EPA SDWA Geographic Areas – Service areas by county, city, and ZIP code.
-
State Average Water Bills – Monthly average household bills by state.
-
U.S. Census County Data – Median household income, county FIPS crosswalks, and socioeconomic context.
What We Built
-
Staging Pipelines: Using Python + Snowflake, we built chunk-optimized loaders for millions of rows across EPA files.
-
Normalization Rules: Standardized states (USPS codes), counties (removing suffixes like County/Parish), and cities (handling St., Ft., Santa variations).
-
Geographic Crosswalks: Mapped utilities to counties and ZIPs using EPA service area files, ANSI codes, and census FIPS.
-
Affordability Metrics: Combined average monthly bills with census median income to estimate % of household income spent on water.
-
Compliance Risk Score: Derived from violation counts, scaled by population served, to quantify relative compliance risk across utilities.
Why These Metrics Matter
-
Regulators: Identify and prioritize systemic compliance risks in high-population utilities.
-
Utilities: Benchmark their performance against peers and highlight areas for improvement.
-
Communities: Understand how water costs compare to median income, exposing affordability pressures.
-
Researchers & Policy Makers: Analyze geographic inequities in water quality and affordability at scale.
Sample Metrics
-
Compliance Risk Score: A normalized score (0–100) blending violation frequency and population served.
-
Affordability Ratio: Average monthly bill ÷ median household income, expressed as % of income.
-
Population Served: A standard baseline to compare utilities of very different sizes.
-
Source Type Analysis: Differentiates compliance risk between groundwater vs. surface water systems.
The Road Ahead
We plan to extend the model with:
-
Machine learning to forecast future compliance risks.
-
Integration of climate and drought data to assess vulnerability.
-
Dashboards for state regulators and utilities to drill down from national benchmarks to local service areas.
By making these benchmarks accessible, DataEngite is committed to enabling transparent, data-driven water management.
