Phase 01 — Data Sourcing & Documentation

Finding the Right Data

Before writing a single line of SQL I identified, verified, and downloaded eight publicly available datasets from the EPA, CDC, and USDA. I wrote Python download scripts for multi-year EPA AQI data and SDWIS water quality records, configured version control to exclude raw data files, and documented every source with its scientific reference and known limitations.

EPA AQI CDC Cancer Data USDA Food Environment SDWIS Water Quality Python Download Scripts 8 Datasets

Overview

What I Did in Phase 1

The Research Question

This project explores the relationship between cancer rates across the United States and contributing environmental, lifestyle, food, and water quality factors. Before I could build a database I needed to find data that could actually answer that question — data that was publicly available, reliable, and compatible enough to join across sources.

The Approach

I identified eight datasets spanning cancer incidence, cancer mortality, EPA air quality index scores, drinking water violations, food environment atlas data, chronic disease indicators, agricultural census data, and CAFO facility records. Each dataset required a different acquisition strategy — some were direct CSV downloads, others required Python scripts to pull multi-year data from APIs or paginated endpoints.

Datasets Sourced

Federal Agencies

Years of AQI Data

22.7M

Rows Loaded

Data Sources

Eight Datasets, Three Agencies

Source Inventory

Every dataset was verified for update frequency, data dictionary availability, and join compatibility before being included. I documented the exact URL, file format, date accessed, and known limitations for each source in the Phase 1 documentation.

Dataset	Source	Format	Target Table
Cancer Incidence by State	CDC / NCI SEER	CSV	fact_cancer_incidence
Cancer Mortality by State	CDC / NCI SEER	CSV	fact_cancer_mortality
Air Quality Index (2000–2023)	EPA AQS	CSV (23 files)	fact_air_quality
Drinking Water Violations	EPA SDWIS	CSV	fact_water_violations
Water System Inventory	EPA SDWIS	CSV	dim_water_system
Food Environment Atlas	USDA ERS	Excel / CSV	fact_food_environment
Chronic Disease Indicators	CDC BRFSS	CSV	fact_chronic_disease_indicators
Census of Agriculture / CAFO	USDA NASS / EPA ECHO	CSV	fact_livestock_operations, dim_cafo_facility

Data Acquisition

Download Scripts & Version Control

Python Download Scripts

Two datasets required automated download scripts rather than one-time manual downloads. The EPA AQI dataset spans 23 annual files from 2000 through 2023 — downloading each manually would be error-prone. I wrote download_epa_aqi.py to fetch all 23 files programmatically. Similarly, I wrote download_water_quality.py with retry logic for the SDWIS endpoint which occasionally returns partial responses.

                    ## download_epa_aqi.py — fetch 23 annual AQI files from EPA AQS
import requests, os

BASE_URL = "https://aqs.epa.gov/aqsweb/airdata/annual_aqi_by_county_{year}.zip"
OUTPUT_DIR = "data/raw/air-quality"

for year in range(2000, 2024):
    url = BASE_URL.format(year=year)
    response = requests.get(url, timeout=30)
    filename = os.path.join(OUTPUT_DIR, f"aqi_{year}.zip")
    with open(filename, "wb") as f:
        f.write(response.content)
    print(f"Downloaded: aqi_{year}.zip")
                

Version Control Strategy

I configured .gitignore to exclude all raw data files from the repository. The data directory structure is tracked but the files themselves are not — raw CSV and ZIP files can reach several GB each and have no place in a Git repository. The download scripts serve as the reproducible record of how the data was acquired.

Decision 01

Scripts Over Manual Downloads

Writing download scripts rather than downloading files manually makes the data acquisition step reproducible. Anyone can clone the repo and re-acquire all source data by running the scripts.

Decision 02

Exclude Raw Data from Git

Raw data files are excluded via .gitignore. This keeps the repository focused on code and documentation — the artifacts that matter for a portfolio project — rather than multi-GB data files that provide no additional signal.

Decision 03

Document Known Limitations

Every dataset has limitations — coverage gaps, state-level vs county-level granularity, CAFO underreporting. I documented these before loading so they inform schema and query decisions rather than being discovered after the fact.

Decision 04

County Dimension from USDA

I built dim_county from the USDA Census of Agriculture rather than a generic county list. This means only counties with agricultural activity are included — a documented tradeoff that keeps the dimension grounded in the data rather than forcing artificial joins.

Project Structure

Repository Layout

Organized by Phase

The repository is structured so every SQL script, Python file, and documentation page maps directly to a project phase. Anyone reading the commit history can follow the project from data sourcing through schema design, optimization, backup, and security in chronological order.

Screenshot — Project folder structure in VS Code

VS Code explorer showing mysql-dba-project folder structure with all phases

Full project structure — sql/, docs/, data/, screenshots/, and backups/ organized by phase

Screenshot — Python download scripts

VS Code showing Python download script for EPA AQI data

download_epa_aqi.py — automated acquisition of 23 annual EPA AQI files

Phase 2 — Schema & Data Loading