Data Engineering Portfolio Project

Medicare Cost-vs-Quality Analysis

An end-to-end Databricks medallion pipeline analyzing whether Medicare actually pays more for better hospital care. The answer, after building 13 Delta tables on 10 million CMS records and running the full bronze-silver-gold flow: not really. Hospital-level spending and mortality correlate at r = 0.033 — essentially zero.

Databricks PySpark Delta Lake Unity Catalog Medallion Architecture CMS 2023 Data

Tableau Dashboard Source Code

Project Overview

The Question Behind the Pipeline

The Business Problem

Medicare is one of the largest healthcare payers in the United States, distributing roughly $94 billion in physician payments and trillions across hospitals each year. A common assumption — among policymakers, hospital systems, and the public — is that higher Medicare spending buys better care. Is that actually true?

This project answers that question with publicly available CMS data using a production-grade data engineering pipeline. The work shows whether physician spending intensity at the hospital level corresponds to better mortality and patient-experience outcomes — and surfaces the structural quirks in Medicare billing data that anyone analyzing it should know about.

Project Objective

Build a fully reproducible Databricks medallion pipeline that ingests five raw CMS datasets totaling ~10 million records, models them through bronze-silver-gold transformations, and produces analytical answers to the core cost-vs-quality question — with every decision documented and every claim defensible.

Data Sources

Five CMS Datasets, ~10 Million Rows

All data comes from public CMS (Centers for Medicare & Medicaid Services) releases for 2023 — the most recent vintage available. No private datasets, no proprietary access. Anyone can reproduce this pipeline against the same files.

Dataset	Purpose	Size
Medicare Physician & Other Practitioners by Provider and Service	Provider-level Medicare payments across HCPCS codes and places of service	2.85 GB / 9.66M rows
Hospital General Information	Hospital identifiers, addresses, ownership types, overall star ratings	1.4 MB / 5,432 rows
Complications & Deaths (Hospital)	30-day mortality measures across six clinical conditions	22 MB / 95,840 rows
Unplanned Hospital Visits	Readmission rates and hospital-wide readmission measures	18 MB / 67,088 rows
HCAHPS (Hospital Patient Experience)	Survey-based patient satisfaction and recommendation scores	100 MB / 325,856 rows

Medallion Architecture

Bronze → Silver → Gold → Analysis

The medallion pattern is a layered approach to data engineering where each layer has a specific job. Raw data lands in bronze with minimal transformation. Silver applies modeling and business logic. Gold produces purpose-built analytical tables. Each layer is queryable independently, each is versioned via Delta Lake, and the full pipeline runs end-to-end in 5 minutes 26 seconds via Databricks Workflows.

Click any layer below to view the corresponding notebook with all code, transformations, and verification queries.

Layer 01 — Bronze

Raw Ingestion

Five CSV files ingested into Delta tables with schema enforcement, type casting, and lineage tracking. ~10M rows landed with full reproducibility.

View Notebook

Layer 02 — Silver

Modeled & Conformed

Five silver tables including provider aggregation, hospital quality unification, and a tier-1 exact-match NPI ↔ CCN bridge connecting 62.6% of acute care hospitals.

View Notebook

Layer 03 — Gold

Analytical Aggregates

Four purpose-built gold tables with pre-computed ranks, percentiles, and value quadrants. Built on medians to defend against organizational-billing outliers.

View Notebook

Layer 04 — Analysis

Findings & Visualizations

Four analytical questions answered with code, charts, and honest interpretation. Surfaces the Arizona NP billing anomaly and the r = 0.033 headline correlation.

View Notebook

Technologies & Tools

Built with the Modern Data Stack

Databricks Lakehouse · Free Edition

PySpark Distributed transformation

Delta Lake ACID storage layer

Unity Catalog Catalog · schemas · governance

Databricks Workflows Pipeline orchestration

Databricks SQL Warehouse · ad-hoc queries

Databricks AI/BI Native dashboard

Tableau Public Public-facing dashboard

Python Analysis · matplotlib

Git & GitHub Version control · documentation

PowerShell Local automation · CLI

Databricks CLI Volume uploads · workspace access

Key Findings

What the Data Actually Shows

r = 0.033

At the hospital level, spending and mortality are essentially independent. Across 1,949 acute care hospitals with full data coverage, the correlation between physician spending intensity per beneficiary and composite mortality rate is nearly zero. The "you get what you pay for" intuition does not hold.

r = −0.118

At the state level, the relationship is weakly negative. States that spend more per beneficiary have, if anything, slightly lower average hospital quality. Arizona spends $170/bene with middling quality; Virginia spends $116/bene with the highest average star rating in the analysis.

~50%

About half of all US hospitals are value-mismatched. The four cost-quality quadrants — high cost/high quality, high cost/low quality, low cost/high quality, low cost/low quality — each contain roughly 400 hospitals. Spending and outcomes don't track together.

30.2%

Nurse Practitioner billing is concentrated in a small organizational cluster. The top 1% of NPs account for 30% of all NP Medicare spending — double the rate of physician specialties. Most of the top-ranked NPIs are organizational entities in Arizona, not individual practitioners. Documented as a deliberate finding, not filtered out.

Explore the Project

Three Ways In

Tableau Dashboard

Interactive cost-vs-quality visualizations. Filter by state, drill into hospitals, see the four-quadrant story for yourself.

Open Dashboard

GitHub Repository

Full source code — notebooks, SQL, architecture documentation, screenshots, and complete commit history across five engineering phases.

View Source

Pipeline Notebooks

The actual Databricks notebooks for each medallion layer. Code, output, transformations — exactly as they ran in the pipeline.

Start with Bronze