Overview Phase 1 — Data Sourcing Phase 2 — Schema & Load Phase 3 — Diagnostics Phase 4 — Optimization Phase 5 — Backup Phase 6 — Security
Phase 01 — Data Sourcing & Documentation

Finding the Right Data

Before writing a single line of SQL I identified, verified, and downloaded eight publicly available datasets from the EPA, CDC, and USDA. I wrote Python download scripts for multi-year EPA AQI data and SDWIS water quality records, configured version control to exclude raw data files, and documented every source with its scientific reference and known limitations.

EPA AQI CDC Cancer Data USDA Food Environment SDWIS Water Quality Python Download Scripts 8 Datasets

What I Did in Phase 1

The Research Question

This project explores the relationship between cancer rates across the United States and contributing environmental, lifestyle, food, and water quality factors. Before I could build a database I needed to find data that could actually answer that question — data that was publicly available, reliable, and compatible enough to join across sources.

The Approach

I identified eight datasets spanning cancer incidence, cancer mortality, EPA air quality index scores, drinking water violations, food environment atlas data, chronic disease indicators, agricultural census data, and CAFO facility records. Each dataset required a different acquisition strategy — some were direct CSV downloads, others required Python scripts to pull multi-year data from APIs or paginated endpoints.

8
Datasets Sourced
3
Federal Agencies
23
Years of AQI Data
22.7M
Rows Loaded

Eight Datasets, Three Agencies

Source Inventory

Every dataset was verified for update frequency, data dictionary availability, and join compatibility before being included. I documented the exact URL, file format, date accessed, and known limitations for each source in the Phase 1 documentation.

Dataset Source Format Target Table
Cancer Incidence by State CDC / NCI SEER CSV fact_cancer_incidence
Cancer Mortality by State CDC / NCI SEER CSV fact_cancer_mortality
Air Quality Index (2000–2023) EPA AQS CSV (23 files) fact_air_quality
Drinking Water Violations EPA SDWIS CSV fact_water_violations
Water System Inventory EPA SDWIS CSV dim_water_system
Food Environment Atlas USDA ERS Excel / CSV fact_food_environment
Chronic Disease Indicators CDC BRFSS CSV fact_chronic_disease_indicators
Census of Agriculture / CAFO USDA NASS / EPA ECHO CSV fact_livestock_operations, dim_cafo_facility

Download Scripts & Version Control

Python Download Scripts

Two datasets required automated download scripts rather than one-time manual downloads. The EPA AQI dataset spans 23 annual files from 2000 through 2023 — downloading each manually would be error-prone. I wrote download_epa_aqi.py to fetch all 23 files programmatically. Similarly, I wrote download_water_quality.py with retry logic for the SDWIS endpoint which occasionally returns partial responses.

## download_epa_aqi.py — fetch 23 annual AQI files from EPA AQS import requests, os BASE_URL = "https://aqs.epa.gov/aqsweb/airdata/annual_aqi_by_county_{year}.zip" OUTPUT_DIR = "data/raw/air-quality" for year in range(2000, 2024): url = BASE_URL.format(year=year) response = requests.get(url, timeout=30) filename = os.path.join(OUTPUT_DIR, f"aqi_{year}.zip") with open(filename, "wb") as f: f.write(response.content) print(f"Downloaded: aqi_{year}.zip")

Version Control Strategy

I configured .gitignore to exclude all raw data files from the repository. The data directory structure is tracked but the files themselves are not — raw CSV and ZIP files can reach several GB each and have no place in a Git repository. The download scripts serve as the reproducible record of how the data was acquired.

Decision 01
Scripts Over Manual Downloads
Writing download scripts rather than downloading files manually makes the data acquisition step reproducible. Anyone can clone the repo and re-acquire all source data by running the scripts.
Decision 02
Exclude Raw Data from Git
Raw data files are excluded via .gitignore. This keeps the repository focused on code and documentation — the artifacts that matter for a portfolio project — rather than multi-GB data files that provide no additional signal.
Decision 03
Document Known Limitations
Every dataset has limitations — coverage gaps, state-level vs county-level granularity, CAFO underreporting. I documented these before loading so they inform schema and query decisions rather than being discovered after the fact.
Decision 04
County Dimension from USDA
I built dim_county from the USDA Census of Agriculture rather than a generic county list. This means only counties with agricultural activity are included — a documented tradeoff that keeps the dimension grounded in the data rather than forcing artificial joins.

Repository Layout

Organized by Phase

The repository is structured so every SQL script, Python file, and documentation page maps directly to a project phase. Anyone reading the commit history can follow the project from data sourcing through schema design, optimization, backup, and security in chronological order.

Screenshot — Project folder structure in VS Code
VS Code explorer showing mysql-dba-project folder structure with all phases
Full project structure — sql/, docs/, data/, screenshots/, and backups/ organized by phase
Screenshot — Python download scripts
VS Code showing Python download script for EPA AQI data
download_epa_aqi.py — automated acquisition of 23 annual EPA AQI files