Database Administration Portfolio Project

MySQL DBA
End-to-End Project

An end-to-end MySQL database administration project I built on publicly available environmental and public health datasets. I designed a star schema across 14 tables, loaded 22.7 million rows of cancer, air quality, water quality, food environment, and agricultural data, diagnosed and resolved critical performance problems, implemented a full backup and recovery strategy, and secured the database with role-based access control.

MySQL 8.0.45 Schema Design Performance Tuning Backup & Recovery Security & RBAC EPA / CDC / USDA Data

View on GitHub Explore the Project

Project Overview

What This Project Demonstrates

The Goal

I wanted to build a MySQL project that covers the full scope of database administration — not just queries. Schema design, bulk data loading, performance baselining, index optimization, backup and recovery, and security aren't isolated skills. They form a complete discipline, and this project treats them that way.

The Dataset

I used eight publicly available datasets from the EPA, CDC, and USDA to explore the relationship between cancer rates across the United States and contributing environmental, lifestyle, food, and water quality factors. The database covers cancer incidence and mortality, air quality, drinking water violations, food environment, chronic disease indicators, agricultural operations, and CAFO facility data — 22.7 million rows across 14 tables in a star schema design.

22.7M

Rows Loaded

Tables

4.57 GB

Database Size

95.5%

Query Improvement

Project Phases

Screenshot — Database Size Summary

Database size summary showing 22.7M rows across 14 tables totaling 4.57 GB

cancer_environment_db — 14 tables, 22.7M rows, 4.57 GB total

Project Phases

Six Phases, One Complete Project

Phase 01 — Data Sourcing

Data Sourcing & Documentation

I identified, verified, and downloaded eight datasets from EPA, CDC, and USDA. I wrote Python download scripts for EPA AQI data across 23 years and SDWIS water quality data, configured .gitignore to exclude raw data files, and documented every source with scientific references.

Phase Complete

Phase 02 — Schema & Loading

Schema Design & Data Loading

I designed a star schema with 4 dimension tables and 10 fact tables, created the database with utf8mb4 charset, and loaded 22.7 million rows across all 14 tables using a combination of SQL LOAD DATA and Python scripts. Every design decision is documented with rationale.

Phase Complete

Phase 03 — Diagnostics

Baseline Measurement & Diagnostics

Before optimizing anything I measured everything — database size, table inventory, and execution times for six representative queries. Q4 took 9 minutes 42 seconds. That is the baseline. EXPLAIN plans identified the root cause for both problem queries before I touched a single index.

Phase Complete

Phase 04 — Optimization

Performance Optimization

I added composite indexes, rewrote queries using CTEs and subquery pre-aggregation, audited and dropped 11 unused indexes, and reviewed MySQL configuration settings. Q4 dropped from 9 minutes 42 seconds to 2,619 ms. Q5 dropped from 4,658 ms to 165 ms.

Phase Complete

Phase 05 — Backup & Recovery

Backup & Recovery

I implemented a mysqldump backup strategy using --single-transaction and --source-data=2 for consistent InnoDB snapshots with binary log position recording. I produced a 2.18 GB full backup, dropped the entire database to simulate data loss, and restored all 22.7 million rows from the backup file.

Phase Complete

Phase 06 — Security

Security & User Management

I designed and implemented role-based access control with three least-privilege roles — db_readonly, db_analyst, and db_etl — and three corresponding users. Every user is scoped to localhost only, and no application user holds global privileges or schema modification rights.

Phase Complete

Schema Design

Star Schema Across 14 Tables

Design Approach

I chose a star schema because the research question — correlating cancer rates with environmental and lifestyle factors — is fundamentally analytical. Dimension tables hold stable reference data (states, counties, years, water systems) while fact tables hold measurements and events. This separation keeps joins clean and query patterns predictable.

The most significant design decision was building dim_county from the USDA Census of Agriculture rather than a generic county list. This means only counties with agricultural activity are included — approximately 1,388 EPA AQI counties and 27,400 food environment records were excluded as a documented known limitation. The tradeoff was worth it: it keeps the county dimension grounded in the agricultural data rather than forcing artificial joins.

Screenshot — Table Inventory

Table inventory showing all 14 tables with row counts and sizes

14 tables — 4 dimension, 10 fact — across the full dataset

Screenshot — Schema Overview in DataGrip

DataGrip database explorer showing all 14 tables in cancer_environment_db

cancer_environment_db in DataGrip — full table inventory visible in the database explorer

Performance Results

Before & After Optimization

I ran six analytical queries before and after adding indexes and rewriting query logic. The results below are measured execution times from DataGrip — real numbers from real runs, not estimates.

Query	Description	Before	After	Result
Q1	Cancer incidence trend by state	12 ms	23 ms	Small table variance
Q2	States with highest avg cancer rate	16 ms	8 ms	50% faster
Q3	Air quality vs cancer incidence	293 ms	58 ms	80% faster
Q4	Health-based water violations by state	9 min 42 sec	2,619 ms	95.5% faster
Q5	CAFO facilities near impaired waters	4,658 ms	165 ms	96.5% faster
Q6	Smoking prevalence vs cancer rate	244 ms	1,640 ms	Rewritten — regression resolved

Screenshot — Q4 Baseline EXPLAIN

EXPLAIN output for Q4 showing full nested loop scan through 15.3M rows

Q4 baseline EXPLAIN — full nested-loop scan through 15.3M violation rows, is_health_based_ind filter applied after the join

Screenshot — Q4 Optimized Result

Q4 optimized query result showing 2,619ms execution time

Q4 after optimization — 9 min 42 sec down to 2,619 ms using CTE pre-filter and composite index

Screenshot — Q5 Optimized Result

Q5 optimized query result showing 165ms execution time

Q5 after optimization — 4,658 ms down to 165 ms using subquery pre-aggregation and corrected composite index

Screenshot — Before / After Comparison

Before and after performance comparison table for all six queries

Full before/after comparison across all six queries after Phase 4 optimization

Key Findings

What the Project Actually Shows

95.5%

A single composite index turned a 9-minute query into a 2.6-second query. Q4 was performing a full nested-loop scan through 15.3 million water violation rows because no index existed on (is_health_based_ind, pwsid). One ALTER TABLE ADD INDEX combined with a CTE rewrite fixed it — the optimizer could now pre-filter health-based violations before joining to any other table.

96.5%

Column order in a composite index is not a detail — it is the whole point. The first Q5 index was built as (state_id, impaired_waters) and the optimizer ignored it entirely, choosing a single-column index instead. Flipping the order to (impaired_waters, state_id) — putting the equality filter column first — dropped Q5 from 3.6 seconds to 165 ms. The wrong column order produced no improvement at all.

Dropping 11 unused indexes introduced two regressions — and diagnosing them honestly was part of the work. The index audit correctly identified 11 indexes as unused, but removing them caused Q3 and Q6 to regress significantly. Investigation revealed the root cause was query design, not just missing indexes. Both queries were rewritten using subquery pre-aggregation to eliminate cross joins between fact tables before aggregating.

2.18 GB

A backup that has never been tested is not a backup. I produced a 2.18 GB mysqldump of the 4.57 GB database, dropped the entire database to simulate catastrophic data loss, and restored all 22.7 million rows from the backup file. Every table was verified with exact COUNT(*) queries after restore. The restore procedure is fully documented in the repository.

Operations & Security

Backup, Recovery & Access Control

Backup Strategy

I implemented a logical backup strategy using mysqldump with --single-transaction for consistent InnoDB snapshots without table locks, and --source-data=2 to record the binary log position in the dump file for point-in-time recovery capability. Binary logging was confirmed active on the instance before backup execution.

Screenshot — Recovery Verification

Post-recovery row count verification showing all 22.7 million rows restored across 14 tables

Post-recovery verification — all 22,774,105 rows confirmed across all 14 tables after full restore from backup

Role-Based Access Control

I designed a three-tier access model using MySQL 8.0 roles. db_readonly grants SELECT only — for Tableau dashboards and reporting. db_analyst adds CREATE TEMPORARY TABLES for intermediate query results. db_etl grants SELECT, INSERT, UPDATE, and DELETE for data loading pipelines. No role holds CREATE, DROP, ALTER, or any global privilege.

Screenshot — Role Assignments

MySQL role assignments showing db_analyst to analyst_user, db_etl to etl_user, db_readonly to readonly_user

Role assignments confirmed via mysql.role_edges — one role per user, least-privilege throughout

Screenshot — User Inventory

User inventory showing analyst_user, etl_user, and readonly_user with no locked accounts

Three application users — all active, localhost only, no global privileges

Technologies & Tools

Built with MySQL 8.0

MySQL 8.0.45 Core database engine

DataGrip Primary query tool

mysqldump Logical backup

Python 3.13 Data download & loading

EXPLAIN Execution plan analysis

sys schema Index usage audit

MySQL Roles Access control

Git & GitHub Version control

PowerShell CLI & automation

VS Code Script authoring

EPA / CDC / USDA Public data sources

Windows 11 Operating environment

Explore the Project

Two Ways In

GitHub Repository

Full source code — all SQL scripts and Python files organized by phase, complete documentation for every phase, and commit history showing every step of the project from day one.

View Source

More Projects

See the full portfolio — including the PostgreSQL DBA project, the Medicare data engineering pipeline, and other data and database projects built on real public datasets.

View All Projects

MySQL DBAEnd-to-End Project

What This Project Demonstrates

The Goal

The Dataset

Six Phases, One Complete Project

Star Schema Across 14 Tables

Design Approach

Before & After Optimization

What the Project Actually Shows

Backup, Recovery & Access Control

Backup Strategy

Role-Based Access Control

Built with MySQL 8.0

Two Ways In

MySQL DBA
End-to-End Project