Overview Phase 1 — Data Sourcing Phase 2 — Schema & Load Phase 3 — Diagnostics Phase 4 — Optimization Phase 5 — Backup Phase 6 — Security
Phase 04 — Performance Optimization

From 9 Minutes to 2.6 Seconds

I added composite indexes, rewrote queries using CTEs and subquery pre-aggregation, audited and dropped 11 unused indexes, reviewed MySQL configuration settings, and resolved two regressions introduced by the index audit. Q4 dropped from 9 minutes 42 seconds to 2,619 ms. Q5 dropped from 4,658 ms to 165 ms.

Composite Indexes CTE Rewrites Index Audit 95.5% Faster (Q4) 96.5% Faster (Q5) Configuration Tuning

Composite Indexes — Column Order Matters

Index 1 — fact_water_violations

I added a composite index on (is_health_based_ind, pwsid) to let the optimizer pre-filter health-based violations before joining to dim_water_system. The column order follows the ESR rule — Equality, Sort, Range — placing the equality filter column first. This index built in 1 minute 7 seconds over 15.3 million rows.

ALTER TABLE fact_water_violations ADD INDEX idx_health_pwsid (is_health_based_ind, pwsid);

Index 2 — dim_cafo_facility (Corrected)

My initial composite index on (state_id, impaired_waters) had the wrong column order — the optimizer ignored it entirely and chose a single-column index instead, leaving Q5 at 3.6 seconds. The fix was a second index with the columns flipped to (impaired_waters, state_id). Putting the equality filter column first allows the optimizer to satisfy both the WHERE clause and the GROUP BY from the index alone. The original index could not be dropped due to a foreign key constraint on state_id.

-- Corrected column order — impaired_waters leads as equality filter ALTER TABLE dim_cafo_facility ADD INDEX idx_impaired_state (impaired_waters, state_id);

CTE & Subquery Pre-Aggregation

Q4 — CTE Pre-Filter

The original Q4 joined all 15.3 million violation rows to dim_water_system and dim_state before filtering. The rewritten query uses a CTE to pre-aggregate health-based violations by pwsid first — the optimizer hits the new index on is_health_based_ind, returns a small result set, then joins that small set to the dimension tables.

WITH health_violations AS ( SELECT pwsid, COUNT(*) AS violation_count FROM fact_water_violations WHERE is_health_based_ind = 'Y' GROUP BY pwsid ) SELECT s.state_name, COUNT(DISTINCT hv.pwsid) AS affected_systems, SUM(hv.violation_count) AS total_violations FROM health_violations hv JOIN dim_water_system ws ON hv.pwsid = ws.pwsid JOIN dim_state s ON ws.state_id = s.state_id GROUP BY s.state_name ORDER BY total_violations DESC;
Screenshot — Q4 optimized result
Q4 optimized query showing 2,619ms execution time
Q4 after optimization — 9 min 42 sec down to 2,619 ms. 95.5% improvement.

Q5 — Subquery Pre-Aggregation

Q5 was rewritten to pre-aggregate dim_cafo_facility by state before joining to dim_state. I also discovered during testing that the impaired_waters column stores '303(D) Listed' rather than 'Y' — confirmed with a DISTINCT value check after Q5 returned 0 rows on the first run.

SELECT s.state_name, cf.facility_count FROM ( SELECT state_id, COUNT(*) AS facility_count FROM dim_cafo_facility WHERE impaired_waters = '303(D) Listed' GROUP BY state_id ) cf JOIN dim_state s ON cf.state_id = s.state_id ORDER BY cf.facility_count DESC;
Screenshot — Q5 optimized result
Q5 optimized query showing 165ms execution time
Q5 after optimization — 4,658 ms down to 165 ms. 96.5% improvement.

57 Indexes Down to 46

Unused Index Identification

I queried sys.schema_unused_indexes and sys.schema_redundant_indexes to identify indexes consuming space without benefit. The audit found 13 unused indexes and 2 redundant indexes. I dropped 11 of the 13 unused indexes — 2 could not be dropped due to foreign key constraints — and both redundant indexes.

Screenshot — Unused indexes from sys schema
sys.schema_unused_indexes showing 13 unused indexes in cancer_environment_db
13 unused indexes identified — 11 dropped, 2 retained due to foreign key constraints

Regression Analysis

Dropping the 11 indexes caused two regressions — Q3 regressed from 293 ms to 903 ms and Q6 regressed from 244 ms to 18,114 ms. Investigation with EXPLAIN showed the root cause was query design, not just missing indexes. Both queries were joining two large fact tables before aggregating, creating a massive cross join. I rewrote both queries using subquery pre-aggregation and added a replacement composite index idx_state_question (state_id, question) on fact_chronic_disease_indicators.

Screenshot — Phase 4 scripts in VS Code
VS Code showing all 6 Phase 4 optimization scripts
6 optimization scripts — indexes, queries, audit, configuration, comparison, and regression fix

Before & After All Six Queries

QueryDescriptionBeforeAfterResult
Q1Cancer incidence trend by state12 ms23 msSmall table variance
Q2States with highest avg cancer rate16 ms8 ms50% faster
Q3Air quality vs cancer incidence293 ms58 ms80% faster
Q4Health-based water violations9 min 42 sec2,619 ms95.5% faster
Q5CAFO facilities near impaired waters4,658 ms165 ms96.5% faster
Q6Smoking prevalence vs cancer rate244 ms1,640 msRewritten — regression resolved
Screenshot — Before / After comparison
Before and after performance comparison table for all six queries
Full before/after comparison — all six queries measured in DataGrip after Phase 4 optimization
95.5%
Q4 went from 9 minutes 42 seconds to 2,619 ms. One composite index combined with a CTE rewrite eliminated the full 15.3M row scan that was driving the original execution time.
Column order
The wrong column order in a composite index produced zero improvement. The first Q5 index — (state_id, impaired_waters) — was ignored by the optimizer. Flipping it to (impaired_waters, state_id) dropped Q5 from 3.6 seconds to 165 ms. Column order is not a detail.