I designed a normalized PostgreSQL schema for a healthcare analytics workload, loaded 9.6 million CMS Medicare records using a staged bulk-load approach, and validated the data before moving forward. Every design decision is documented with a rationale.
I used the CMS Medicare Physician & Other Practitioners by Provider and Service dataset for 2023. Before writing a single line of SQL I inspected the raw CSV — 27 columns, 9.6 million rows, 3 GB on disk. Understanding the data structure first is what leads to good schema decisions.
Rather than loading the raw file directly into a production table, I used a staged loading strategy — land everything into a staging table first, then deduplicate and transform into the final normalized tables. This gives full control over data quality before anything touches the curated schema.
I created a dedicated cms schema rather than using the default public schema. In production environments it is standard practice to isolate project objects in their own schema for clarity, access control, and maintainability. It also makes role-based permissions cleaner — you can grant access to an entire schema rather than individual tables.
cms.providers (one row per NPI) and cms.provider_services (one row per provider/procedure). This eliminates redundancy and reflects how a production DBA would model this data.provider_services because the natural key (NPI + HCPCS code) is not enforced as unique in the source data — a provider can appear multiple times for the same procedure code.Loading raw CSV data directly into a typed production table risks rejecting rows due to type mismatches or unexpected values. By staging everything as VARCHAR first, I could inspect the data, handle edge cases like empty strings in numeric columns using NULLIF, and deduplicate providers before inserting into the final tables.
I used PostgreSQL's COPY command for the initial bulk load — significantly faster than row-by-row INSERT for large files. The 3 GB CSV file loaded into the staging table in 57 seconds.
COPY cms.staging_raw
FROM 'C:\temp\MUP_PHY_R25_P05_V20_D23_Prov_Svc.csv'
WITH (
FORMAT CSV,
HEADER true,
DELIMITER ',',
QUOTE '"',
ENCODING 'WIN1252'
);
| Step | Rows | Duration |
|---|---|---|
| Staging load (COPY) | 9,660,647 | 57 seconds |
| Providers insert (deduplicated) | 1,175,281 | 42 seconds |
| Provider services insert | 9,660,647 | 2 min 59 sec |
After loading I verified row counts across all three tables to confirm the staging table and services table match exactly, and the providers table correctly deduplicated to unique NPIs.
| Check | Result |
|---|---|
| Null NPIs in providers | 0 |
| Orphaned service rows | 0 |
| Null HCPCS codes | 0 |
| Null beneficiary counts | 0 |
| Min Medicare payment | $0.00 |
| Max Medicare payment | $42,059.52 |
| Avg Medicare payment | $82.99 |
| Total Medicare payments | $801,745,767.09 |