An end-to-end Databricks medallion pipeline analyzing whether Medicare actually pays more for better hospital care. The answer, after building 13 Delta tables on 10 million CMS records and running the full bronze-silver-gold flow: not really. Hospital-level spending and mortality correlate at r = 0.033 — essentially zero.
Medicare is one of the largest healthcare payers in the United States, distributing roughly $94 billion in physician payments and trillions across hospitals each year. A common assumption — among policymakers, hospital systems, and the public — is that higher Medicare spending buys better care. Is that actually true?
This project answers that question with publicly available CMS data using a production-grade data engineering pipeline. The work shows whether physician spending intensity at the hospital level corresponds to better mortality and patient-experience outcomes — and surfaces the structural quirks in Medicare billing data that anyone analyzing it should know about.
Build a fully reproducible Databricks medallion pipeline that ingests five raw CMS datasets totaling ~10 million records, models them through bronze-silver-gold transformations, and produces analytical answers to the core cost-vs-quality question — with every decision documented and every claim defensible.
All data comes from public CMS (Centers for Medicare & Medicaid Services) releases for 2023 — the most recent vintage available. No private datasets, no proprietary access. Anyone can reproduce this pipeline against the same files.
| Dataset | Purpose | Size |
|---|---|---|
| Medicare Physician & Other Practitioners by Provider and Service | Provider-level Medicare payments across HCPCS codes and places of service | 2.85 GB / 9.66M rows |
| Hospital General Information | Hospital identifiers, addresses, ownership types, overall star ratings | 1.4 MB / 5,432 rows |
| Complications & Deaths (Hospital) | 30-day mortality measures across six clinical conditions | 22 MB / 95,840 rows |
| Unplanned Hospital Visits | Readmission rates and hospital-wide readmission measures | 18 MB / 67,088 rows |
| HCAHPS (Hospital Patient Experience) | Survey-based patient satisfaction and recommendation scores | 100 MB / 325,856 rows |
The medallion pattern is a layered approach to data engineering where each layer has a specific job. Raw data lands in bronze with minimal transformation. Silver applies modeling and business logic. Gold produces purpose-built analytical tables. Each layer is queryable independently, each is versioned via Delta Lake, and the full pipeline runs end-to-end in 5 minutes 26 seconds via Databricks Workflows.
Click any layer below to view the corresponding notebook with all code, transformations, and verification queries.
Five CSV files ingested into Delta tables with schema enforcement, type casting, and lineage tracking. ~10M rows landed with full reproducibility.
View NotebookFive silver tables including provider aggregation, hospital quality unification, and a tier-1 exact-match NPI ↔ CCN bridge connecting 62.6% of acute care hospitals.
View NotebookFour purpose-built gold tables with pre-computed ranks, percentiles, and value quadrants. Built on medians to defend against organizational-billing outliers.
View NotebookFour analytical questions answered with code, charts, and honest interpretation. Surfaces the Arizona NP billing anomaly and the r = 0.033 headline correlation.
View NotebookInteractive cost-vs-quality visualizations. Filter by state, drill into hospitals, see the four-quadrant story for yourself.
Open DashboardFull source code — notebooks, SQL, architecture documentation, screenshots, and complete commit history across five engineering phases.
View SourceThe actual Databricks notebooks for each medallion layer. Code, output, transformations — exactly as they ran in the pipeline.
Start with Bronze