Phase 3 — Baseline Measurement & Diagnostics

Overview

What I Did in Phase 3

The Baseline Principle

Optimizing without measuring is guessing. Before making any changes to indexes or query structure I ran every analytical query against the unoptimized database and recorded exact execution times. This gives a honest baseline to measure improvements against — and it forces you to understand what the optimizer is actually doing before you tell it to do something different.

Six Queries, Two Problems

I designed six analytical queries covering the main research questions — cancer trends, air quality correlations, water violation counts, CAFO facility proximity, and smoking prevalence. Four queries performed acceptably. Two did not — Q4 at 9 minutes 42 seconds and Q5 at 4,658 ms. Both were diagnosed with EXPLAIN before any fixes were attempted.

Baseline Queries

9m 42s

Q4 Baseline

4,658ms

Q5 Baseline

15.3M

Rows Scanned (Q4)

Baseline Results

Six Queries Before Optimization

Execution Times

All six queries were run in DataGrip against the unoptimized database with default MySQL 8.0 configuration and no custom indexes beyond the primary keys and foreign key indexes created during schema design.

Query	Description	Execution Time	Status
Q1	Cancer incidence trend by state	12 ms	Acceptable
Q2	States with highest average cancer rate	16 ms	Acceptable
Q3	Air quality vs cancer incidence by state	293 ms	Acceptable
Q4	Health-based water violations by state	9 min 42 sec	Critical
Q5	CAFO facilities near impaired waters	4,658 ms	Problem
Q6	Smoking prevalence vs cancer rate	244 ms	Acceptable

Diagnostics

EXPLAIN Analysis

Q4 — The 9-Minute Query

EXPLAIN revealed that Q4 was performing a full nested-loop scan through all 15.3 million rows in fact_water_violations. The is_health_based_ind = 'Y' filter was being applied only after the join to dim_water_system and dim_state — meaning MySQL was joining the entire 15.3M row table before filtering it down to the health-based subset. The fix required a composite index on (is_health_based_ind, pwsid) plus a CTE rewrite to pre-filter before joining.

Screenshot — Q4 baseline EXPLAIN output

EXPLAIN output showing full table scan through 15.3M rows in fact_water_violations

Q4 EXPLAIN — type: ALL on fact_water_violations, 15.3M rows, is_health_based_ind filter applied after the join

Q5 — The 4.6-Second CAFO Query

EXPLAIN showed Q5 was scanning all 1.18 million rows in dim_cafo_facility despite an existing state_id index. The optimizer ignored it because impaired_waters had very low cardinality — only two distinct values. With no index covering impaired_waters, the optimizer chose a full scan over the existing index. The fix required a composite index on (impaired_waters, state_id) with the correct column order.

ALL

Q4 EXPLAIN type was ALL — a full table scan. MySQL was reading every one of the 15.3 million violation rows in fact_water_violations before applying any filter. No index existed on is_health_based_ind or on the combination of is_health_based_ind and pwsid that the query needed.

ref

Q5 EXPLAIN showed the optimizer bypassing the state_id index entirely. Low cardinality on impaired_waters (only 2 distinct values across 1.18M rows) made the optimizer prefer a full scan over the existing single-column index. Column order in the new composite index would determine whether it was used at all.

Phase Summary

What the Baseline Told Me

Before vs After — The Targets

Phase 3 established the exact targets for Phase 4. Every optimization decision in Phase 4 traces back to what EXPLAIN revealed here. The two problem queries had clear, diagnosable root causes — and the diagnosis came before any fix was attempted.

Query	Root Cause Identified	Fix Planned
Q4	Full scan on 15.3M rows, filter applied after join	Composite index + CTE rewrite
Q5	State_id index bypassed due to low cardinality on impaired_waters	Composite index with correct column order

Phase 2 — Schema & Load Phase 4 — Optimization

Measure Before You Optimize