I designed a star schema with 4 dimension tables and 10 fact tables, created the database with utf8mb4 charset and InnoDB storage engine, and loaded 22.7 million rows across all 14 tables using SQL LOAD DATA and Python scripts. Every schema decision is documented with a rationale.
I chose a star schema because the research question — correlating cancer rates with environmental and lifestyle factors — is fundamentally analytical. Dimension tables hold stable reference data while fact tables hold measurements and events. This separation keeps joins predictable and query patterns clean.
I created cancer_environment_db with utf8mb4 charset and utf8mb4_unicode_ci collation to support the full Unicode range including special characters in geographic and scientific names. InnoDB was the only storage engine considered — ACID compliance and foreign key support are non-negotiable for a relational database of this complexity.
The schema uses four dimension tables — dim_state, dim_county, dim_year, and dim_water_system — as the stable anchors, with ten fact tables radiating outward for each measurement domain.
I wrote a separate load script for each table — 16 scripts total across Phase 2. Dimension tables were loaded first to establish the reference data, then fact tables in dependency order. For large fact tables like fact_water_violations (15.3M rows) and fact_livestock_operations (3.2M rows), I used LOAD DATA LOCAL INFILE which is significantly faster than row-by-row INSERT.
-- 12_load_water_violations.sql
LOAD DATA LOCAL INFILE 'data/raw/water-quality/SDWA_VIOLATIONS_ENFORCEMENT.csv'
INTO TABLE fact_water_violations
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(pwsid, violation_id, violation_category_code,
is_health_based_ind, contaminant_code,
non_compl_per_begin_date, non_compl_per_end_date);
| Table | Rows Loaded | Method |
|---|---|---|
| fact_water_violations | 15,298,031 | LOAD DATA LOCAL INFILE |
| fact_livestock_operations | 3,339,228 | LOAD DATA LOCAL INFILE |
| dim_cafo_facility | 1,188,507 | LOAD DATA LOCAL INFILE |
| fact_food_environment | 930,317 | LOAD DATA LOCAL INFILE |
| fact_cafo_sic_codes | 784,937 | LOAD DATA LOCAL INFILE |
| fact_cafo_violations | 395,599 | LOAD DATA LOCAL INFILE |
| dim_water_system | 433,698 | Python / LOAD DATA |
| fact_chronic_disease_indicators | 375,987 | LOAD DATA LOCAL INFILE |
| fact_air_quality | 23,100 | Python (23 annual files) |
| dim_county | 3,079 | SQL INSERT |
| fact_cancer_incidence | 1,218 | SQL INSERT |
| fact_cancer_mortality | 306 | SQL INSERT |
| dim_state | 51 | SQL INSERT |
| dim_year | 47 | SQL INSERT |
After all 14 tables were loaded I verified row counts and database size to confirm the load completed correctly and the total size matched expectations for a 22.7 million row dataset.