Before writing a single line of SQL I identified, verified, and downloaded eight publicly available datasets from the EPA, CDC, and USDA. I wrote Python download scripts for multi-year EPA AQI data and SDWIS water quality records, configured version control to exclude raw data files, and documented every source with its scientific reference and known limitations.
This project explores the relationship between cancer rates across the United States and contributing environmental, lifestyle, food, and water quality factors. Before I could build a database I needed to find data that could actually answer that question — data that was publicly available, reliable, and compatible enough to join across sources.
I identified eight datasets spanning cancer incidence, cancer mortality, EPA air quality index scores, drinking water violations, food environment atlas data, chronic disease indicators, agricultural census data, and CAFO facility records. Each dataset required a different acquisition strategy — some were direct CSV downloads, others required Python scripts to pull multi-year data from APIs or paginated endpoints.
Every dataset was verified for update frequency, data dictionary availability, and join compatibility before being included. I documented the exact URL, file format, date accessed, and known limitations for each source in the Phase 1 documentation.
| Dataset | Source | Format | Target Table |
|---|---|---|---|
| Cancer Incidence by State | CDC / NCI SEER | CSV | fact_cancer_incidence |
| Cancer Mortality by State | CDC / NCI SEER | CSV | fact_cancer_mortality |
| Air Quality Index (2000–2023) | EPA AQS | CSV (23 files) | fact_air_quality |
| Drinking Water Violations | EPA SDWIS | CSV | fact_water_violations |
| Water System Inventory | EPA SDWIS | CSV | dim_water_system |
| Food Environment Atlas | USDA ERS | Excel / CSV | fact_food_environment |
| Chronic Disease Indicators | CDC BRFSS | CSV | fact_chronic_disease_indicators |
| Census of Agriculture / CAFO | USDA NASS / EPA ECHO | CSV | fact_livestock_operations, dim_cafo_facility |
Two datasets required automated download scripts rather than one-time manual downloads. The EPA AQI dataset spans 23 annual files from 2000 through 2023 — downloading each manually would be error-prone. I wrote download_epa_aqi.py to fetch all 23 files programmatically. Similarly, I wrote download_water_quality.py with retry logic for the SDWIS endpoint which occasionally returns partial responses.
## download_epa_aqi.py — fetch 23 annual AQI files from EPA AQS
import requests, os
BASE_URL = "https://aqs.epa.gov/aqsweb/airdata/annual_aqi_by_county_{year}.zip"
OUTPUT_DIR = "data/raw/air-quality"
for year in range(2000, 2024):
url = BASE_URL.format(year=year)
response = requests.get(url, timeout=30)
filename = os.path.join(OUTPUT_DIR, f"aqi_{year}.zip")
with open(filename, "wb") as f:
f.write(response.content)
print(f"Downloaded: aqi_{year}.zip")
I configured .gitignore to exclude all raw data files from the repository. The data directory structure is tracked but the files themselves are not — raw CSV and ZIP files can reach several GB each and have no place in a Git repository. The download scripts serve as the reproducible record of how the data was acquired.
The repository is structured so every SQL script, Python file, and documentation page maps directly to a project phase. Anyone reading the commit history can follow the project from data sourcing through schema design, optimization, backup, and security in chronological order.