I added composite indexes, rewrote queries using CTEs and subquery pre-aggregation, audited and dropped 11 unused indexes, reviewed MySQL configuration settings, and resolved two regressions introduced by the index audit. Q4 dropped from 9 minutes 42 seconds to 2,619 ms. Q5 dropped from 4,658 ms to 165 ms.
I added a composite index on (is_health_based_ind, pwsid) to let the optimizer pre-filter health-based violations before joining to dim_water_system. The column order follows the ESR rule — Equality, Sort, Range — placing the equality filter column first. This index built in 1 minute 7 seconds over 15.3 million rows.
ALTER TABLE fact_water_violations
ADD INDEX idx_health_pwsid (is_health_based_ind, pwsid);
My initial composite index on (state_id, impaired_waters) had the wrong column order — the optimizer ignored it entirely and chose a single-column index instead, leaving Q5 at 3.6 seconds. The fix was a second index with the columns flipped to (impaired_waters, state_id). Putting the equality filter column first allows the optimizer to satisfy both the WHERE clause and the GROUP BY from the index alone. The original index could not be dropped due to a foreign key constraint on state_id.
-- Corrected column order — impaired_waters leads as equality filter
ALTER TABLE dim_cafo_facility
ADD INDEX idx_impaired_state (impaired_waters, state_id);
The original Q4 joined all 15.3 million violation rows to dim_water_system and dim_state before filtering. The rewritten query uses a CTE to pre-aggregate health-based violations by pwsid first — the optimizer hits the new index on is_health_based_ind, returns a small result set, then joins that small set to the dimension tables.
WITH health_violations AS (
SELECT pwsid, COUNT(*) AS violation_count
FROM fact_water_violations
WHERE is_health_based_ind = 'Y'
GROUP BY pwsid
)
SELECT
s.state_name,
COUNT(DISTINCT hv.pwsid) AS affected_systems,
SUM(hv.violation_count) AS total_violations
FROM health_violations hv
JOIN dim_water_system ws ON hv.pwsid = ws.pwsid
JOIN dim_state s ON ws.state_id = s.state_id
GROUP BY s.state_name
ORDER BY total_violations DESC;
Q5 was rewritten to pre-aggregate dim_cafo_facility by state before joining to dim_state. I also discovered during testing that the impaired_waters column stores '303(D) Listed' rather than 'Y' — confirmed with a DISTINCT value check after Q5 returned 0 rows on the first run.
SELECT s.state_name, cf.facility_count
FROM (
SELECT state_id, COUNT(*) AS facility_count
FROM dim_cafo_facility
WHERE impaired_waters = '303(D) Listed'
GROUP BY state_id
) cf
JOIN dim_state s ON cf.state_id = s.state_id
ORDER BY cf.facility_count DESC;
I queried sys.schema_unused_indexes and sys.schema_redundant_indexes to identify indexes consuming space without benefit. The audit found 13 unused indexes and 2 redundant indexes. I dropped 11 of the 13 unused indexes — 2 could not be dropped due to foreign key constraints — and both redundant indexes.
Dropping the 11 indexes caused two regressions — Q3 regressed from 293 ms to 903 ms and Q6 regressed from 244 ms to 18,114 ms. Investigation with EXPLAIN showed the root cause was query design, not just missing indexes. Both queries were joining two large fact tables before aggregating, creating a massive cross join. I rewrote both queries using subquery pre-aggregation and added a replacement composite index idx_state_question (state_id, question) on fact_chronic_disease_indicators.
| Query | Description | Before | After | Result |
|---|---|---|---|---|
| Q1 | Cancer incidence trend by state | 12 ms | 23 ms | Small table variance |
| Q2 | States with highest avg cancer rate | 16 ms | 8 ms | 50% faster |
| Q3 | Air quality vs cancer incidence | 293 ms | 58 ms | 80% faster |
| Q4 | Health-based water violations | 9 min 42 sec | 2,619 ms | 95.5% faster |
| Q5 | CAFO facilities near impaired waters | 4,658 ms | 165 ms | 96.5% faster |
| Q6 | Smoking prevalence vs cancer rate | 244 ms | 1,640 ms | Rewritten — regression resolved |