Phase 01 — Schema Design & Data Loading

Building the Foundation

I designed a normalized PostgreSQL schema for a healthcare analytics workload, loaded 9.6 million CMS Medicare records using a staged bulk-load approach, and validated the data before moving forward. Every design decision is documented with a rationale.

Schema Design COPY Bulk Load Data Normalization Data Validation 9.6M Rows

Overview

What I Did in Phase 1

The Dataset

I used the CMS Medicare Physician & Other Practitioners by Provider and Service dataset for 2023. Before writing a single line of SQL I inspected the raw CSV — 27 columns, 9.6 million rows, 3 GB on disk. Understanding the data structure first is what leads to good schema decisions.

The Approach

Rather than loading the raw file directly into a production table, I used a staged loading strategy — land everything into a staging table first, then deduplicate and transform into the final normalized tables. This gives full control over data quality before anything touches the curated schema.

9.66M

Raw Rows Loaded

1.17M

Unique Providers

57s

Staging Load Time

Data Integrity Errors

Schema Design

Design Decisions & Rationale

Why a Separate Schema?

I created a dedicated cms schema rather than using the default public schema. In production environments it is standard practice to isolate project objects in their own schema for clarity, access control, and maintainability. It also makes role-based permissions cleaner — you can grant access to an entire schema rather than individual tables.

Decision 01

Normalized Two-Table Design

The source data is flat — provider attributes repeat on every service row. I split this into cms.providers (one row per NPI) and cms.provider_services (one row per provider/procedure). This eliminates redundancy and reflects how a production DBA would model this data.

Decision 02

NPI as VARCHAR(10)

The NPI is stored as VARCHAR(10), not INTEGER or BIGINT. NPIs are identifiers — they are never used in calculations. Storing them as text avoids leading-zero issues and makes the intent clear to anyone reading the schema.

Decision 03

BIGSERIAL Surrogate Key

I added a BIGSERIAL surrogate key on provider_services because the natural key (NPI + HCPCS code) is not enforced as unique in the source data — a provider can appear multiple times for the same procedure code.

Decision 04

NUMERIC for Payment Columns

Payment and charge columns use NUMERIC(12,2) for exact decimal precision. FLOAT would introduce rounding errors for financial data. RUCA codes use NUMERIC(4,1) to accommodate decimal values in the source data.

Data Loading

Staged Bulk Load Strategy

Why Stage First?

Loading raw CSV data directly into a typed production table risks rejecting rows due to type mismatches or unexpected values. By staging everything as VARCHAR first, I could inspect the data, handle edge cases like empty strings in numeric columns using NULLIF, and deduplicate providers before inserting into the final tables.

The COPY Command

I used PostgreSQL's COPY command for the initial bulk load — significantly faster than row-by-row INSERT for large files. The 3 GB CSV file loaded into the staging table in 57 seconds.

                    COPY cms.staging_raw
FROM 'C:\temp\MUP_PHY_R25_P05_V20_D23_Prov_Svc.csv'
WITH (
    FORMAT CSV,
    HEADER true,
    DELIMITER ',',
    QUOTE '"',
    ENCODING 'WIN1252'
);
                

Load Results

Step	Rows	Duration
Staging load (COPY)	9,660,647	57 seconds
Providers insert (deduplicated)	1,175,281	42 seconds
Provider services insert	9,660,647	2 min 59 sec

Results

Data Loaded & Validated

Row Count Verification

After loading I verified row counts across all three tables to confirm the staging table and services table match exactly, and the providers table correctly deduplicated to unique NPIs.

Screenshot — Row counts across all tables

Row counts for staging_raw, providers, and provider_services tables

All three tables confirmed — 9,660,647 service rows, 1,175,281 unique providers

Screenshot — Table sizes after load

Table sizes showing staging_raw at 3.3GB, provider_services at 1.7GB, providers at 298MB

Total database size: 5,571 MB — staging_raw at 3.3 GB, provider_services at 1.7 GB, providers at 298 MB

Validation Results

Check	Result
Null NPIs in providers	0
Orphaned service rows	0
Null HCPCS codes	0
Null beneficiary counts	0
Min Medicare payment	$0.00
Max Medicare payment	$42,059.52
Avg Medicare payment	$82.99
Total Medicare payments	$801,745,767.09

Phase 2 — Diagnostics