Phase 04 — Performance Optimization

From 9 Minutes to 2.6 Seconds

I added composite indexes, rewrote queries using CTEs and subquery pre-aggregation, audited and dropped 11 unused indexes, reviewed MySQL configuration settings, and resolved two regressions introduced by the index audit. Q4 dropped from 9 minutes 42 seconds to 2,619 ms. Q5 dropped from 4,658 ms to 165 ms.

Composite Indexes CTE Rewrites Index Audit 95.5% Faster (Q4) 96.5% Faster (Q5) Configuration Tuning

Index Optimization

Composite Indexes — Column Order Matters

Index 1 — fact_water_violations

I added a composite index on (is_health_based_ind, pwsid) to let the optimizer pre-filter health-based violations before joining to dim_water_system. The column order follows the ESR rule — Equality, Sort, Range — placing the equality filter column first. This index built in 1 minute 7 seconds over 15.3 million rows.

                    ALTER TABLE fact_water_violations
    ADD INDEX idx_health_pwsid (is_health_based_ind, pwsid);
                

Index 2 — dim_cafo_facility (Corrected)

My initial composite index on (state_id, impaired_waters) had the wrong column order — the optimizer ignored it entirely and chose a single-column index instead, leaving Q5 at 3.6 seconds. The fix was a second index with the columns flipped to (impaired_waters, state_id). Putting the equality filter column first allows the optimizer to satisfy both the WHERE clause and the GROUP BY from the index alone. The original index could not be dropped due to a foreign key constraint on state_id.

                    -- Corrected column order — impaired_waters leads as equality filter
ALTER TABLE dim_cafo_facility
    ADD INDEX idx_impaired_state (impaired_waters, state_id);
                

Query Rewrites

CTE & Subquery Pre-Aggregation

Q4 — CTE Pre-Filter

The original Q4 joined all 15.3 million violation rows to dim_water_system and dim_state before filtering. The rewritten query uses a CTE to pre-aggregate health-based violations by pwsid first — the optimizer hits the new index on is_health_based_ind, returns a small result set, then joins that small set to the dimension tables.

                    WITH health_violations AS (
    SELECT pwsid, COUNT(*) AS violation_count
    FROM fact_water_violations
    WHERE is_health_based_ind = 'Y'
    GROUP BY pwsid
)
SELECT
    s.state_name,
    COUNT(DISTINCT hv.pwsid) AS affected_systems,
    SUM(hv.violation_count)  AS total_violations
FROM health_violations hv
JOIN dim_water_system ws ON hv.pwsid    = ws.pwsid
JOIN dim_state s         ON ws.state_id = s.state_id
GROUP BY s.state_name
ORDER BY total_violations DESC;
                

Screenshot — Q4 optimized result

Q4 optimized query showing 2,619ms execution time

Q4 after optimization — 9 min 42 sec down to 2,619 ms. 95.5% improvement.

Q5 — Subquery Pre-Aggregation

Q5 was rewritten to pre-aggregate dim_cafo_facility by state before joining to dim_state. I also discovered during testing that the impaired_waters column stores '303(D) Listed' rather than 'Y' — confirmed with a DISTINCT value check after Q5 returned 0 rows on the first run.

                    SELECT s.state_name, cf.facility_count
FROM (
    SELECT state_id, COUNT(*) AS facility_count
    FROM dim_cafo_facility
    WHERE impaired_waters = '303(D) Listed'
    GROUP BY state_id
) cf
JOIN dim_state s ON cf.state_id = s.state_id
ORDER BY cf.facility_count DESC;
                

Screenshot — Q5 optimized result

Q5 optimized query showing 165ms execution time

Q5 after optimization — 4,658 ms down to 165 ms. 96.5% improvement.

Index Audit

57 Indexes Down to 46

Unused Index Identification

I queried sys.schema_unused_indexes and sys.schema_redundant_indexes to identify indexes consuming space without benefit. The audit found 13 unused indexes and 2 redundant indexes. I dropped 11 of the 13 unused indexes — 2 could not be dropped due to foreign key constraints — and both redundant indexes.

Screenshot — Unused indexes from sys schema

sys.schema_unused_indexes showing 13 unused indexes in cancer_environment_db

13 unused indexes identified — 11 dropped, 2 retained due to foreign key constraints

Regression Analysis

Dropping the 11 indexes caused two regressions — Q3 regressed from 293 ms to 903 ms and Q6 regressed from 244 ms to 18,114 ms. Investigation with EXPLAIN showed the root cause was query design, not just missing indexes. Both queries were joining two large fact tables before aggregating, creating a massive cross join. I rewrote both queries using subquery pre-aggregation and added a replacement composite index idx_state_question (state_id, question) on fact_chronic_disease_indicators.

Screenshot — Phase 4 scripts in VS Code

VS Code showing all 6 Phase 4 optimization scripts

6 optimization scripts — indexes, queries, audit, configuration, comparison, and regression fix

Final Results

Before & After All Six Queries

Query	Description	Before	After	Result
Q1	Cancer incidence trend by state	12 ms	23 ms	Small table variance
Q2	States with highest avg cancer rate	16 ms	8 ms	50% faster
Q3	Air quality vs cancer incidence	293 ms	58 ms	80% faster
Q4	Health-based water violations	9 min 42 sec	2,619 ms	95.5% faster
Q5	CAFO facilities near impaired waters	4,658 ms	165 ms	96.5% faster
Q6	Smoking prevalence vs cancer rate	244 ms	1,640 ms	Rewritten — regression resolved

Screenshot — Before / After comparison

Before and after performance comparison table for all six queries

Full before/after comparison — all six queries measured in DataGrip after Phase 4 optimization

95.5%

Q4 went from 9 minutes 42 seconds to 2,619 ms. One composite index combined with a CTE rewrite eliminated the full 15.3M row scan that was driving the original execution time.

Column order

The wrong column order in a composite index produced zero improvement. The first Q5 index — (state_id, impaired_waters) — was ignored by the optimizer. Flipping it to (impaired_waters, state_id) dropped Q5 from 3.6 seconds to 165 ms. Column order is not a detail.

Phase 3 — Diagnostics Phase 5 — Backup