Data Engineering Portfolio Project

Medicare Cost-vs-Quality Analysis

An end-to-end Databricks medallion pipeline analyzing whether Medicare actually pays more for better hospital care. The answer, after building 13 Delta tables on 10 million CMS records and running the full bronze-silver-gold flow: not really. Hospital-level spending and mortality correlate at r = 0.033 — essentially zero.

Databricks PySpark Delta Lake Unity Catalog Medallion Architecture CMS 2023 Data

The Question Behind the Pipeline

The Business Problem

Medicare is one of the largest healthcare payers in the United States, distributing roughly $94 billion in physician payments and trillions across hospitals each year. A common assumption — among policymakers, hospital systems, and the public — is that higher Medicare spending buys better care. Is that actually true?

This project answers that question with publicly available CMS data using a production-grade data engineering pipeline. The work shows whether physician spending intensity at the hospital level corresponds to better mortality and patient-experience outcomes — and surfaces the structural quirks in Medicare billing data that anyone analyzing it should know about.

Project Objective

Build a fully reproducible Databricks medallion pipeline that ingests five raw CMS datasets totaling ~10 million records, models them through bronze-silver-gold transformations, and produces analytical answers to the core cost-vs-quality question — with every decision documented and every claim defensible.

Five CMS Datasets, ~10 Million Rows

All data comes from public CMS (Centers for Medicare & Medicaid Services) releases for 2023 — the most recent vintage available. No private datasets, no proprietary access. Anyone can reproduce this pipeline against the same files.

Dataset Purpose Size
Medicare Physician & Other Practitioners by Provider and Service Provider-level Medicare payments across HCPCS codes and places of service 2.85 GB / 9.66M rows
Hospital General Information Hospital identifiers, addresses, ownership types, overall star ratings 1.4 MB / 5,432 rows
Complications & Deaths (Hospital) 30-day mortality measures across six clinical conditions 22 MB / 95,840 rows
Unplanned Hospital Visits Readmission rates and hospital-wide readmission measures 18 MB / 67,088 rows
HCAHPS (Hospital Patient Experience) Survey-based patient satisfaction and recommendation scores 100 MB / 325,856 rows

Bronze → Silver → Gold → Analysis

The medallion pattern is a layered approach to data engineering where each layer has a specific job. Raw data lands in bronze with minimal transformation. Silver applies modeling and business logic. Gold produces purpose-built analytical tables. Each layer is queryable independently, each is versioned via Delta Lake, and the full pipeline runs end-to-end in 5 minutes 26 seconds via Databricks Workflows.

Click any layer below to view the corresponding notebook with all code, transformations, and verification queries.

Built with the Modern Data Stack

Databricks Lakehouse · Free Edition
PySpark Distributed transformation
Delta Lake ACID storage layer
Unity Catalog Catalog · schemas · governance
Databricks Workflows Pipeline orchestration
Databricks SQL Warehouse · ad-hoc queries
Databricks AI/BI Native dashboard
Tableau Public Public-facing dashboard
Python Analysis · matplotlib
Git & GitHub Version control · documentation
PowerShell Local automation · CLI
Databricks CLI Volume uploads · workspace access

What the Data Actually Shows

r = 0.033
At the hospital level, spending and mortality are essentially independent. Across 1,949 acute care hospitals with full data coverage, the correlation between physician spending intensity per beneficiary and composite mortality rate is nearly zero. The "you get what you pay for" intuition does not hold.
r = −0.118
At the state level, the relationship is weakly negative. States that spend more per beneficiary have, if anything, slightly lower average hospital quality. Arizona spends $170/bene with middling quality; Virginia spends $116/bene with the highest average star rating in the analysis.
~50%
About half of all US hospitals are value-mismatched. The four cost-quality quadrants — high cost/high quality, high cost/low quality, low cost/high quality, low cost/low quality — each contain roughly 400 hospitals. Spending and outcomes don't track together.
30.2%
Nurse Practitioner billing is concentrated in a small organizational cluster. The top 1% of NPs account for 30% of all NP Medicare spending — double the rate of physician specialties. Most of the top-ranked NPIs are organizational entities in Arizona, not individual practitioners. Documented as a deliberate finding, not filtered out.

Three Ways In