Overview Phase 1 — Data Sourcing Phase 2 — Schema & Load Phase 3 — Diagnostics Phase 4 — Optimization Phase 5 — Backup Phase 6 — Security
Phase 03 — Baseline Measurement & Diagnostics

Measure Before You Optimize

Before touching a single index I measured everything — database size, table inventory, and execution times for six analytical queries. Q4 took 9 minutes and 42 seconds. That is the baseline. EXPLAIN plans identified the root cause for both problem queries before any changes were made.

EXPLAIN Execution Plans Query Timing Index Analysis 6 Baseline Queries

What I Did in Phase 3

The Baseline Principle

Optimizing without measuring is guessing. Before making any changes to indexes or query structure I ran every analytical query against the unoptimized database and recorded exact execution times. This gives a honest baseline to measure improvements against — and it forces you to understand what the optimizer is actually doing before you tell it to do something different.

Six Queries, Two Problems

I designed six analytical queries covering the main research questions — cancer trends, air quality correlations, water violation counts, CAFO facility proximity, and smoking prevalence. Four queries performed acceptably. Two did not — Q4 at 9 minutes 42 seconds and Q5 at 4,658 ms. Both were diagnosed with EXPLAIN before any fixes were attempted.

6
Baseline Queries
9m 42s
Q4 Baseline
4,658ms
Q5 Baseline
15.3M
Rows Scanned (Q4)

Six Queries Before Optimization

Execution Times

All six queries were run in DataGrip against the unoptimized database with default MySQL 8.0 configuration and no custom indexes beyond the primary keys and foreign key indexes created during schema design.

QueryDescriptionExecution TimeStatus
Q1Cancer incidence trend by state12 msAcceptable
Q2States with highest average cancer rate16 msAcceptable
Q3Air quality vs cancer incidence by state293 msAcceptable
Q4Health-based water violations by state9 min 42 secCritical
Q5CAFO facilities near impaired waters4,658 msProblem
Q6Smoking prevalence vs cancer rate244 msAcceptable

EXPLAIN Analysis

Q4 — The 9-Minute Query

EXPLAIN revealed that Q4 was performing a full nested-loop scan through all 15.3 million rows in fact_water_violations. The is_health_based_ind = 'Y' filter was being applied only after the join to dim_water_system and dim_state — meaning MySQL was joining the entire 15.3M row table before filtering it down to the health-based subset. The fix required a composite index on (is_health_based_ind, pwsid) plus a CTE rewrite to pre-filter before joining.

Screenshot — Q4 baseline EXPLAIN output
EXPLAIN output showing full table scan through 15.3M rows in fact_water_violations
Q4 EXPLAIN — type: ALL on fact_water_violations, 15.3M rows, is_health_based_ind filter applied after the join

Q5 — The 4.6-Second CAFO Query

EXPLAIN showed Q5 was scanning all 1.18 million rows in dim_cafo_facility despite an existing state_id index. The optimizer ignored it because impaired_waters had very low cardinality — only two distinct values. With no index covering impaired_waters, the optimizer chose a full scan over the existing index. The fix required a composite index on (impaired_waters, state_id) with the correct column order.

ALL
Q4 EXPLAIN type was ALL — a full table scan. MySQL was reading every one of the 15.3 million violation rows in fact_water_violations before applying any filter. No index existed on is_health_based_ind or on the combination of is_health_based_ind and pwsid that the query needed.
ref
Q5 EXPLAIN showed the optimizer bypassing the state_id index entirely. Low cardinality on impaired_waters (only 2 distinct values across 1.18M rows) made the optimizer prefer a full scan over the existing single-column index. Column order in the new composite index would determine whether it was used at all.

What the Baseline Told Me

Before vs After — The Targets

Phase 3 established the exact targets for Phase 4. Every optimization decision in Phase 4 traces back to what EXPLAIN revealed here. The two problem queries had clear, diagnosable root causes — and the diagnosis came before any fix was attempted.

QueryRoot Cause IdentifiedFix Planned
Q4Full scan on 15.3M rows, filter applied after joinComposite index + CTE rewrite
Q5State_id index bypassed due to low cardinality on impaired_watersComposite index with correct column order