An end-to-end MySQL database administration project I built on publicly available environmental and public health datasets. I designed a star schema across 14 tables, loaded 22.7 million rows of cancer, air quality, water quality, food environment, and agricultural data, diagnosed and resolved critical performance problems, implemented a full backup and recovery strategy, and secured the database with role-based access control.
I wanted to build a MySQL project that covers the full scope of database administration — not just queries. Schema design, bulk data loading, performance baselining, index optimization, backup and recovery, and security aren't isolated skills. They form a complete discipline, and this project treats them that way.
I used eight publicly available datasets from the EPA, CDC, and USDA to explore the relationship between cancer rates across the United States and contributing environmental, lifestyle, food, and water quality factors. The database covers cancer incidence and mortality, air quality, drinking water violations, food environment, chronic disease indicators, agricultural operations, and CAFO facility data — 22.7 million rows across 14 tables in a star schema design.
cancer_environment_db — 14 tables, 22.7M rows, 4.57 GB total
I identified, verified, and downloaded eight datasets from EPA, CDC, and USDA. I wrote Python download scripts for EPA AQI data across 23 years and SDWIS water quality data, configured .gitignore to exclude raw data files, and documented every source with scientific references.
Phase CompleteI designed a star schema with 4 dimension tables and 10 fact tables, created the database with utf8mb4 charset, and loaded 22.7 million rows across all 14 tables using a combination of SQL LOAD DATA and Python scripts. Every design decision is documented with rationale.
Phase CompleteBefore optimizing anything I measured everything — database size, table inventory, and execution times for six representative queries. Q4 took 9 minutes 42 seconds. That is the baseline. EXPLAIN plans identified the root cause for both problem queries before I touched a single index.
Phase CompleteI added composite indexes, rewrote queries using CTEs and subquery pre-aggregation, audited and dropped 11 unused indexes, and reviewed MySQL configuration settings. Q4 dropped from 9 minutes 42 seconds to 2,619 ms. Q5 dropped from 4,658 ms to 165 ms.
Phase CompleteI implemented a mysqldump backup strategy using --single-transaction and --source-data=2 for consistent InnoDB snapshots with binary log position recording. I produced a 2.18 GB full backup, dropped the entire database to simulate data loss, and restored all 22.7 million rows from the backup file.
Phase CompleteI designed and implemented role-based access control with three least-privilege roles — db_readonly, db_analyst, and db_etl — and three corresponding users. Every user is scoped to localhost only, and no application user holds global privileges or schema modification rights.
Phase CompleteI chose a star schema because the research question — correlating cancer rates with environmental and lifestyle factors — is fundamentally analytical. Dimension tables hold stable reference data (states, counties, years, water systems) while fact tables hold measurements and events. This separation keeps joins clean and query patterns predictable.
The most significant design decision was building dim_county from the USDA Census of Agriculture rather than a generic county list. This means only counties with agricultural activity are included — approximately 1,388 EPA AQI counties and 27,400 food environment records were excluded as a documented known limitation. The tradeoff was worth it: it keeps the county dimension grounded in the agricultural data rather than forcing artificial joins.
14 tables — 4 dimension, 10 fact — across the full dataset
cancer_environment_db in DataGrip — full table inventory visible in the database explorer
I ran six analytical queries before and after adding indexes and rewriting query logic. The results below are measured execution times from DataGrip — real numbers from real runs, not estimates.
| Query | Description | Before | After | Result |
|---|---|---|---|---|
| Q1 | Cancer incidence trend by state | 12 ms | 23 ms | Small table variance |
| Q2 | States with highest avg cancer rate | 16 ms | 8 ms | 50% faster |
| Q3 | Air quality vs cancer incidence | 293 ms | 58 ms | 80% faster |
| Q4 | Health-based water violations by state | 9 min 42 sec | 2,619 ms | 95.5% faster |
| Q5 | CAFO facilities near impaired waters | 4,658 ms | 165 ms | 96.5% faster |
| Q6 | Smoking prevalence vs cancer rate | 244 ms | 1,640 ms | Rewritten — regression resolved |
Q4 baseline EXPLAIN — full nested-loop scan through 15.3M violation rows, is_health_based_ind filter applied after the join
Q4 after optimization — 9 min 42 sec down to 2,619 ms using CTE pre-filter and composite index
Q5 after optimization — 4,658 ms down to 165 ms using subquery pre-aggregation and corrected composite index
Full before/after comparison across all six queries after Phase 4 optimization
(is_health_based_ind, pwsid). One ALTER TABLE ADD INDEX combined with a CTE rewrite fixed it — the optimizer could now pre-filter health-based violations before joining to any other table.(state_id, impaired_waters) and the optimizer ignored it entirely, choosing a single-column index instead. Flipping the order to (impaired_waters, state_id) — putting the equality filter column first — dropped Q5 from 3.6 seconds to 165 ms. The wrong column order produced no improvement at all.I implemented a logical backup strategy using mysqldump with --single-transaction for consistent InnoDB snapshots without table locks, and --source-data=2 to record the binary log position in the dump file for point-in-time recovery capability. Binary logging was confirmed active on the instance before backup execution.
Post-recovery verification — all 22,774,105 rows confirmed across all 14 tables after full restore from backup
I designed a three-tier access model using MySQL 8.0 roles. db_readonly grants SELECT only — for Tableau dashboards and reporting. db_analyst adds CREATE TEMPORARY TABLES for intermediate query results. db_etl grants SELECT, INSERT, UPDATE, and DELETE for data loading pipelines. No role holds CREATE, DROP, ALTER, or any global privilege.
Role assignments confirmed via mysql.role_edges — one role per user, least-privilege throughout
Three application users — all active, localhost only, no global privileges
Full source code — all SQL scripts and Python files organized by phase, complete documentation for every phase, and commit history showing every step of the project from day one.
View SourceSee the full portfolio — including the PostgreSQL DBA project, the Medicare data engineering pipeline, and other data and database projects built on real public datasets.
View All Projects