Before touching a single index I measured everything — database size, table inventory, and execution times for six analytical queries. Q4 took 9 minutes and 42 seconds. That is the baseline. EXPLAIN plans identified the root cause for both problem queries before any changes were made.
Optimizing without measuring is guessing. Before making any changes to indexes or query structure I ran every analytical query against the unoptimized database and recorded exact execution times. This gives a honest baseline to measure improvements against — and it forces you to understand what the optimizer is actually doing before you tell it to do something different.
I designed six analytical queries covering the main research questions — cancer trends, air quality correlations, water violation counts, CAFO facility proximity, and smoking prevalence. Four queries performed acceptably. Two did not — Q4 at 9 minutes 42 seconds and Q5 at 4,658 ms. Both were diagnosed with EXPLAIN before any fixes were attempted.
All six queries were run in DataGrip against the unoptimized database with default MySQL 8.0 configuration and no custom indexes beyond the primary keys and foreign key indexes created during schema design.
| Query | Description | Execution Time | Status |
|---|---|---|---|
| Q1 | Cancer incidence trend by state | 12 ms | Acceptable |
| Q2 | States with highest average cancer rate | 16 ms | Acceptable |
| Q3 | Air quality vs cancer incidence by state | 293 ms | Acceptable |
| Q4 | Health-based water violations by state | 9 min 42 sec | Critical |
| Q5 | CAFO facilities near impaired waters | 4,658 ms | Problem |
| Q6 | Smoking prevalence vs cancer rate | 244 ms | Acceptable |
EXPLAIN revealed that Q4 was performing a full nested-loop scan through all 15.3 million rows in fact_water_violations. The is_health_based_ind = 'Y' filter was being applied only after the join to dim_water_system and dim_state — meaning MySQL was joining the entire 15.3M row table before filtering it down to the health-based subset. The fix required a composite index on (is_health_based_ind, pwsid) plus a CTE rewrite to pre-filter before joining.
EXPLAIN showed Q5 was scanning all 1.18 million rows in dim_cafo_facility despite an existing state_id index. The optimizer ignored it because impaired_waters had very low cardinality — only two distinct values. With no index covering impaired_waters, the optimizer chose a full scan over the existing index. The fix required a composite index on (impaired_waters, state_id) with the correct column order.
Phase 3 established the exact targets for Phase 4. Every optimization decision in Phase 4 traces back to what EXPLAIN revealed here. The two problem queries had clear, diagnosable root causes — and the diagnosis came before any fix was attempted.
| Query | Root Cause Identified | Fix Planned |
|---|---|---|
| Q4 | Full scan on 15.3M rows, filter applied after join | Composite index + CTE rewrite |
| Q5 | State_id index bypassed due to low cardinality on impaired_waters | Composite index with correct column order |