Education Technology Impact Analysis

Author

Naseer Aryobee

Project Overview

This project analyzes how education technology adoption may relate to public school performance across the United States.

The goal of this project is to build a portfolio-grade data engineering and analytics workflow that extracts, cleans, transforms, loads, and analyzes education-related datasets using Python, PostgreSQL, SQL, and Quarto.

I use this workflow to develop a reproducible analytics pipeline that can support long-term research into whether access to technology in schools is associated with measurable differences in educational outcomes.

Project Objectives

The main objectives of this project are to:

Extract raw public education, student outcome, and technology participation datasets from trusted national sources
Clean and standardize inconsistent public-sector data files
Build a PostgreSQL analytical warehouse using fact and dimension tables
Connect district, state, year, educational outcome, and technology adoption data through a dimensional model
Perform SQL-based analysis directly against PostgreSQL
Create visualizations and reporting outputs inside a reproducible Quarto HTML report
Evaluate how technology-enabled learning participation may relate to measurable student outcome trends

Data Sources

This project uses publicly available education and technology-related datasets, including:

Dataset	Source Purpose
CCD Common Core of Data	District and school operational information
NAEP National Assessment of Educational Progress	State-level academic performance outcomes
CRDC Civil Rights Data Collection — LEA Characteristics	District-level characteristics and enrollment information
CRDC Distance Education	Technology-enabled learning participation and distance education enrollment indicators

The completed workflow integrates CCD district data, NAEP Grade 8 mathematics performance data, CRDC district characteristics, and CRDC distance education participation metrics. Together, these datasets support analysis of how technology adoption and district-level conditions may relate to student outcome patterns.

Tools and Technologies

This project uses:

Tool	Purpose
Python	Data extraction, cleaning, transformation, and validation
pandas	Data manipulation and profiling
PostgreSQL	Analytical warehouse storage
SQL	Warehouse querying and analysis
Quarto	Reproducible HTML reporting
matplotlib	Data visualization
GitHub	Version control and portfolio documentation

Research Direction

The central research question behind this project is:

Does increased access to technology-enabled learning in public schools appear to be associated with measurable differences in academic performance across states, districts, and reporting years?

This project builds an end-to-end data engineering and analytics workflow to explore that question using integrated public education datasets, PostgreSQL warehouse modeling, SQL analysis, and reproducible reporting.

Project Setup

Before any data work, I load the libraries used throughout the report and connect to the local PostgreSQL warehouse. Credentials are read from a .env file (see .env.example for the variables you need to define locally). The .env file itself is gitignored — it never leaves the machine.

import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv, find_dotenv
from sqlalchemy import create_engine
from sqlalchemy.engine import URL

# Load credentials from the local .env file.
load_dotenv(find_dotenv(usecwd=True))

DB_USER     = os.getenv("DB_USER")
DB_PASSWORD = os.getenv("DB_PASSWORD")
DB_HOST     = os.getenv("DB_HOST")
DB_PORT     = os.getenv("DB_PORT")
DB_NAME     = os.getenv("DB_NAME")

# Fail fast if anything is missing — easier to debug than a cryptic error later.
required = {
    "DB_USER": DB_USER, "DB_PASSWORD": DB_PASSWORD,
    "DB_HOST": DB_HOST, "DB_PORT": DB_PORT, "DB_NAME": DB_NAME
}
missing = [k for k, v in required.items() if not v]
if missing:
    raise RuntimeError(f"Missing environment variables: {missing}")

# URL.create handles special characters in the password safely.
database_url = URL.create(
    drivername="postgresql+psycopg2",
    username=DB_USER,
    password=DB_PASSWORD,
    host=DB_HOST,
    port=int(DB_PORT),
    database=DB_NAME,
)
engine = create_engine(database_url)

# Analytical SQL queries live in sql/queries/ rather than being embedded
# inline as Python strings. This helper reads one of those files.
# The notebooks/ folder is one level under the repo root, so queries are
# at ../sql/queries/<filename>.sql relative to this .qmd.
SQL_QUERIES_DIR = Path("../sql/queries").resolve()

def read_sql_file(filename: str) -> str:
    """Read a SQL file from sql/queries/ and return its contents."""
    path = SQL_QUERIES_DIR / filename
    if not path.exists():
        raise FileNotFoundError(f"SQL file not found: {path}")
    return path.read_text(encoding="utf-8")

# Quick connection sanity check.
with engine.connect() as conn:
    print(f"Connected to PostgreSQL: {DB_NAME} @ {DB_HOST}:{DB_PORT}")

Connected to PostgreSQL: education_tech_impact @ localhost:5432

Data Extraction — CCD Dataset

In this step, I load the Common Core of Data (CCD) dataset into Python for initial inspection and preprocessing.

This dataset contains district-level information including enrollment, number of schools, and geographic details.

I use pandas to read the CSV file and examine its structure before loading it into PostgreSQL.

# Define the relative path to the CCD dataset.
# The raw data is stored locally and excluded from GitHub using .gitignore.
file_path = "../data/raw/ccd/ccd_lea_2024.csv"

# Load the CCD CSV file into a pandas DataFrame.
# This allows me to inspect the dataset before database loading.
df_ccd = pd.read_csv(file_path)

# Display the first five records to confirm the file loaded correctly.
df_ccd.head()

C:\Users\User\AppData\Local\Temp\ipykernel_21316\1148928295.py:7: DtypeWarning: Columns (0: MSTREET3) have mixed types. Specify dtype option on import or set low_memory=False.
  df_ccd = pd.read_csv(file_path)

	SCHOOL_YEAR	FIPST	STATENAME	ST	LEA_NAME	STATE_AGENCY_NO	UNION	ST_LEAID	LEAID	MSTREET1	...	G_11_OFFERED	G_12_OFFERED	G_13_OFFERED	G_UG_OFFERED	G_AE_OFFERED	GSLO	GSHI	LEVEL	IGOFFERED	OPERATIONAL_SCHOOLS
0	2024-2025	1	ALABAMA	AL	Alabama Youth Services	1	NaN	AL-210	100002	1000 Industrial School Road	...	Yes	Yes	No	No	No	KG	12	Other	As reported	0
1	2024-2025	1	ALABAMA	AL	Albertville City	1	NaN	AL-101	100005	8379 US Highway 431	...	Yes	Yes	No	No	No	PK	12	Other	As reported	7
2	2024-2025	1	ALABAMA	AL	Marshall County	1	NaN	AL-048	100006	12380 US Highway 431 S	...	Yes	Yes	No	No	No	PK	12	Other	As reported	17
3	2024-2025	1	ALABAMA	AL	Hoover City	1	NaN	AL-158	100007	2810 Metropolitan Way	...	Yes	Yes	No	No	No	PK	12	Other	As reported	18
4	2024-2025	1	ALABAMA	AL	Madison City	1	NaN	AL-169	100008	211 Celtic Dr	...	Yes	Yes	No	No	No	PK	12	Other	As reported	13

5 rows × 58 columns

Initial Data Inspection — CCD Dataset

In this step, I inspect the CCD dataset to understand its size, structure, and key columns before performing any cleaning or transformation.

This inspection helps confirm that the data loaded correctly and provides an initial understanding of the fields available for analysis.

# Display the number of rows and columns in the CCD dataset.
# This gives me a quick overview of the dataset size before cleaning.
row_count, column_count = df_ccd.shape

print(f"Number of rows: {row_count}")
print(f"Number of columns: {column_count}")

# Display all column names so I can identify fields needed for the district dimension table.
df_ccd.columns.tolist()

Number of rows: 19630
Number of columns: 58

['SCHOOL_YEAR',
 'FIPST',
 'STATENAME',
 'ST',
 'LEA_NAME',
 'STATE_AGENCY_NO',
 'UNION',
 'ST_LEAID',
 'LEAID',
 'MSTREET1',
 'MSTREET2',
 'MSTREET3',
 'MCITY',
 'MSTATE',
 'MZIP',
 'MZIP4',
 'LSTREET1',
 'LSTREET2',
 'LSTREET3',
 'LCITY',
 'LSTATE',
 'LZIP',
 'LZIP4',
 'PHONE',
 'WEBSITE',
 'SY_STATUS',
 'SY_STATUS_TEXT',
 'UPDATED_STATUS',
 'UPDATED_STATUS_TEXT',
 'EFFECTIVE_DATE',
 'LEA_TYPE',
 'LEA_TYPE_TEXT',
 'OUT_OF_STATE_FLAG',
 'CHARTER_LEA',
 'CHARTER_LEA_TEXT',
 'NOGRADES',
 'G_PK_OFFERED',
 'G_KG_OFFERED',
 'G_1_OFFERED',
 'G_2_OFFERED',
 'G_3_OFFERED',
 'G_4_OFFERED',
 'G_5_OFFERED',
 'G_6_OFFERED',
 'G_7_OFFERED',
 'G_8_OFFERED',
 'G_9_OFFERED',
 'G_10_OFFERED',
 'G_11_OFFERED',
 'G_12_OFFERED',
 'G_13_OFFERED',
 'G_UG_OFFERED',
 'G_AE_OFFERED',
 'GSLO',
 'GSHI',
 'LEVEL',
 'IGOFFERED',
 'OPERATIONAL_SCHOOLS']

Selecting District-Level Columns

In this step, I identify the key district-level fields that will be used to build the district dimension table in PostgreSQL.

The CCD dataset contains many administrative and metadata fields. For the initial data model, I focus on the core district identification and geographic columns needed for analysis.

# Select district-related columns needed for the dimension table.
district_columns = [
    "LEAID",
    "LEA_NAME",
    "ST",
    "STATENAME",
    "MCITY",
    "WEBSITE",
    "OPERATIONAL_SCHOOLS"
]

# Create a smaller district-focused DataFrame.
df_districts = df_ccd[district_columns]

# Display the first five district records.
df_districts.head()

	LEAID	LEA_NAME	ST	STATENAME	MCITY	WEBSITE	OPERATIONAL_SCHOOLS
0	100002	Alabama Youth Services	AL	ALABAMA	Mt Meigs	http://www.dys.alabama.gov/school-district.html	0
1	100005	Albertville City	AL	ALABAMA	Albertville	http://www.albertk12.org	7
2	100006	Marshall County	AL	ALABAMA	Guntersville	http://www.marshallk12.org	17
3	100007	Hoover City	AL	ALABAMA	Hoover	http://www.hoovercityschools.net	18
4	100008	Madison City	AL	ALABAMA	Madison	http://www.madisoncity.k12.al.us	13

Cleaning the District Dimension Dataset

Before loading the district data into PostgreSQL, I clean the dataset to remove duplicate districts and standardize the column structure.

This step helps ensure the dimension table contains one unique record per school district.

# Create a copy of the district DataFrame to preserve the original dataset.
df_district_dim = df_districts.copy()

# Remove duplicate district records using the LEAID identifier.
df_district_dim = df_district_dim.drop_duplicates(subset=["LEAID"])

# Standardize column names for PostgreSQL compatibility.
df_district_dim.columns = [
    "district_id",
    "district_name",
    "state_code",
    "state_name",
    "city",
    "website",
    "operational_schools"
]

# Reset the index after cleaning so the DataFrame has a clean sequence.
df_district_dim = df_district_dim.reset_index(drop=True)

# Display the cleaned district dimension dataset.
df_district_dim.head()

	district_id	district_name	state_code	state_name	city	website	operational_schools
0	100002	Alabama Youth Services	AL	ALABAMA	Mt Meigs	http://www.dys.alabama.gov/school-district.html	0
1	100005	Albertville City	AL	ALABAMA	Albertville	http://www.albertk12.org	7
2	100006	Marshall County	AL	ALABAMA	Guntersville	http://www.marshallk12.org	17
3	100007	Hoover City	AL	ALABAMA	Hoover	http://www.hoovercityschools.net	18
4	100008	Madison City	AL	ALABAMA	Madison	http://www.madisoncity.k12.al.us	13

Loading the District Dimension Table into PostgreSQL

After cleaning the district dataset, I load the data into PostgreSQL as a dimension table.

This step creates the foundation for future analytics and fact tables that will be connected through district identifiers.

# Load the cleaned district dimension dataset into PostgreSQL.
# if_exists="replace" allows this workflow to be rerun during development
# without creating duplicate tables or duplicate records.
df_district_dim.to_sql(
    name="dim_districts",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("District dimension table loaded successfully into PostgreSQL.")

District dimension table loaded successfully into PostgreSQL.

Verifying the PostgreSQL Target Database

Before continuing, I verify which PostgreSQL database the workflow is connected to and confirm where the table was created.

# Query PostgreSQL to confirm the active database, schema, and user.
connection_check_query = """
SELECT
    current_database() AS active_database,
    current_schema() AS active_schema,
    current_user AS active_user;
"""

# Display the PostgreSQL connection details.
pd.read_sql(connection_check_query, engine)

	active_database	active_schema	active_user
0	education_tech_impact	public	postgres

Verifying the District Dimension Table in PostgreSQL

After loading the district dimension table, I verify that the table exists inside PostgreSQL.

# Query PostgreSQL to confirm the dimension table exists.
table_check_query = """
SELECT
    table_schema,
    table_name
FROM information_schema.tables
WHERE table_schema = 'public'
  AND table_name = 'dim_districts';
"""

# Display matching PostgreSQL tables.
pd.read_sql(table_check_query, engine)

	table_schema	table_name
0	public	dim_districts

Counting Records in the District Dimension Table

To confirm the table was loaded correctly, I query PostgreSQL and count the total number of district records stored in the dimension table.

# Count records inside the PostgreSQL dimension table.
district_count_query = """
SELECT COUNT(*) AS total_district_records
FROM public.dim_districts;
"""

# Display the total number of district records.
pd.read_sql(district_count_query, engine)

	total_district_records
0	19630

Previewing Records from the District Dimension Table

As a final validation step, I preview the first few records directly from the PostgreSQL table.

# Preview records directly from the PostgreSQL dimension table.
district_preview_query = """
SELECT
    district_id,
    district_name,
    state_code,
    state_name,
    city,
    operational_schools
FROM public.dim_districts
LIMIT 5;
"""

# Display the first five records from PostgreSQL.
pd.read_sql(district_preview_query, engine)

	district_id	district_name	state_code	state_name	city	operational_schools
0	100002	Alabama Youth Services	AL	ALABAMA	Mt Meigs	0
1	100005	Albertville City	AL	ALABAMA	Albertville	7
2	100006	Marshall County	AL	ALABAMA	Guntersville	17
3	100007	Hoover City	AL	ALABAMA	Hoover	18
4	100008	Madison City	AL	ALABAMA	Madison	13

Creating the State Dimension Table

To support state-level analysis, I create a separate state dimension table from the cleaned district dataset.

This table stores one unique record per state and allows future fact tables to connect district-level data to state-level summaries.

# Create the state dimension table from the cleaned district dataset.
df_state_dim = (
    df_district_dim[["state_code", "state_name"]]
    .drop_duplicates()
    .sort_values(by="state_code")
    .reset_index(drop=True)
)

# Add a numeric state key for PostgreSQL modeling.
df_state_dim.insert(0, "state_key", range(1, len(df_state_dim) + 1))

# Display the state dimension table.
df_state_dim.head()

	state_key	state_code	state_name
0	1	AK	ALASKA
1	2	AL	ALABAMA
2	3	AR	ARKANSAS
3	4	AS	AMERICAN SAMOA
4	5	AZ	ARIZONA

Loading the State Dimension Table into PostgreSQL

After creating the state dimension dataset, I load the table into PostgreSQL for future state-level analytics and reporting.

# Load the state dimension table into PostgreSQL.
df_state_dim.to_sql(
    name="dim_states",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("State dimension table loaded successfully into PostgreSQL.")

State dimension table loaded successfully into PostgreSQL.

Verifying the State Dimension Table in PostgreSQL

To confirm the state dimension table loaded successfully, I query PostgreSQL and count the total number of state records.

state_count_query = """
SELECT COUNT(*) AS total_state_records
FROM public.dim_states;
"""

pd.read_sql(state_count_query, engine)

	total_state_records
0	57

Creating the District School Counts Fact Table

To support analytical reporting, I create the first fact table using district-level school count data.

This table connects each district to its state and stores the number of operational schools reported in the CCD dataset.

# Create a fact table for district-level school counts.
fact_district_school_counts = df_district_dim[
    [
        "district_id",
        "state_code",
        "operational_schools"
    ]
].copy()

# Rename the school count column to make the metric purpose clear.
fact_district_school_counts = fact_district_school_counts.rename(
    columns={"operational_schools": "operational_school_count"}
)

# Display the first five records from the fact table.
fact_district_school_counts.head()

	district_id	state_code	operational_school_count
0	100002	AL	0
1	100005	AL	7
2	100006	AL	17
3	100007	AL	18
4	100008	AL	13

Loading the District School Counts Fact Table into PostgreSQL

After creating the district school counts fact table, I load it into PostgreSQL so it can be queried with SQL and used for future dashboard analysis.

# Load the district school counts fact table into PostgreSQL.
fact_district_school_counts.to_sql(
    name="fact_district_school_counts",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("District school counts fact table loaded successfully into PostgreSQL.")

District school counts fact table loaded successfully into PostgreSQL.

Analyzing States with the Highest Number of Operational Schools

To begin the analytical portion of the project, I query PostgreSQL to identify which states have the highest total number of operational schools.

This query joins the state dimension table with the district school counts fact table and aggregates the results at the state level.

# SQL query to identify states with the highest operational school counts.
top_states_query = read_sql_file("top_states_by_operational_schools.sql")
# Execute the SQL query and display the results.
pd.read_sql(top_states_query, engine)

	state_name	total_operational_schools
0	CALIFORNIA	10303.0
1	TEXAS	9192.0
2	NEW YORK	4810.0
3	ILLINOIS	4398.0
4	FLORIDA	4186.0
5	OHIO	3543.0
6	MICHIGAN	3468.0
7	PENNSYLVANIA	2932.0
8	MINNESOTA	2741.0
9	NORTH CAROLINA	2705.0

Visualizing the Top 10 States by Operational Schools

To make the SQL analysis easier to interpret, I create a bar chart showing the top 10 states by total operational school count.

This visualization helps translate the PostgreSQL query results into a clear portfolio-ready analytical output.

# Store the SQL query result in a DataFrame for visualization.
top_states_df = pd.read_sql(top_states_query, engine)

# Create a horizontal bar chart for readability.
plt.figure(figsize=(10, 6))
plt.barh(
    top_states_df["state_name"],
    top_states_df["total_operational_schools"]
)

# Invert the y-axis so the state with the highest count appears at the top.
plt.gca().invert_yaxis()

# Add chart labels and title.
plt.title("Top 10 States by Operational School Count")
plt.xlabel("Total Operational Schools")
plt.ylabel("State")

# Improve spacing so labels do not overlap.
plt.subplots_adjust(left=0.28)

# Display the chart in the Quarto HTML report.
plt.show()

Analyzing Average Operational Schools per District by State

To compare state-level district structures, I calculate the average number of operational schools per district for each state.

This helps show which states tend to have larger or smaller district-level school systems.

# SQL query to calculate the average number of operational schools per district by state.
avg_schools_per_district_query = read_sql_file("avg_schools_per_district_by_state.sql")
# Execute the SQL query and display the results.
pd.read_sql(avg_schools_per_district_query, engine)

	state_name	avg_operational_schools_per_district
0	PUERTO RICO	862.00
1	HAWAII	296.00
2	MARYLAND	56.32
3	FLORIDA	49.83
4	NEVADA	36.75
5	NORTHERN MARIANAS	35.00
6	AMERICAN SAMOA	28.00
7	SOUTH CAROLINA	13.30
8	TENNESSEE	12.84
9	VIRGINIA	10.17

Visualizing Average Operational Schools per District by State

To better interpret the state-level district averages, I create a second visualization showing the average number of operational schools per district.

# Store the SQL query result in a DataFrame for visualization.
avg_schools_df = pd.read_sql(avg_schools_per_district_query, engine)

# Create a horizontal bar chart.
plt.figure(figsize=(10, 6))
plt.barh(
    avg_schools_df["state_name"],
    avg_schools_df["avg_operational_schools_per_district"]
)

# Display the highest averages at the top.
plt.gca().invert_yaxis()

# Add chart labels and title.
plt.title("Average Operational Schools per District by State")
plt.xlabel("Average Operational Schools")
plt.ylabel("State")

# Improve chart spacing.
plt.subplots_adjust(left=0.28)

# Display the visualization.
plt.show()

Tableau Dashboard Development Plan

The core data engineering and analytical workflow is now complete. The remaining presentation layer for this project is Tableau dashboard development.

The Tableau dashboard will use the cleaned warehouse outputs and focus on communicating the project’s main findings clearly for recruiters, hiring managers, and interview discussions.

NAEP Grade 8 mathematics performance by state
Post-pandemic score changes between 2019 and 2024
Distance education adoption and participation by state
Enrollment size compared with educational performance
Technology-enabled learning participation compared with student outcome measures

This final dashboard layer will summarize the technical workflow in an executive-style visual format while the Quarto report remains the reproducible data engineering documentation.

Current PostgreSQL Warehouse Schema

At this stage of the project, the PostgreSQL warehouse contains dimension and fact tables supporting operational education data, student outcome analysis, district characteristics, and distance education technology adoption metrics.

Table Name	Description
dim_districts	Stores district-level identification and geographic information
dim_states	Stores state-level identifiers and surrogate state keys
dim_years	Stores reporting years for longitudinal analysis
dim_crdc_district_characteristics	Stores CRDC district characteristics, enrollment, and district-level context

Table Name	Description
fact_district_school_counts	Stores operational school counts by district and state
fact_district_school_counts_enhanced	Stores school counts using surrogate state keys for enhanced warehouse modeling
fact_naep_math_performance	Stores NAEP Grade 8 mathematics performance outcomes by state and reporting year
fact_distance_education_adoption	Stores CRDC distance education adoption and participation metrics by district

This warehouse structure supports multi-source educational analytics, longitudinal student outcome analysis, distance education adoption reporting, and Tableau dashboard development.

Data Model Relationships

The warehouse model connects operational, outcome, district characteristic, and technology adoption data through shared state, district, and year identifiers.

Fact / Analytical Table	Related Dimension / Dataset	Join Key
fact_district_school_counts	dim_districts	district_id
fact_district_school_counts_enhanced	dim_states	state_key
fact_naep_math_performance	dim_states	state_key
fact_naep_math_performance	dim_years	year_key
dim_crdc_district_characteristics	fact_distance_education_adoption	district_id
fact_distance_education_adoption	NAEP state-level analytical outputs	state_name

This design allows technology adoption metrics, district characteristics, enrollment scale, and student outcome measures to be analyzed together in a consistent warehouse-driven workflow.

Data Quality Validation Checks

Before using the warehouse tables for analysis, I perform several basic data quality checks to validate the integrity of the transformed data.

# Count missing district identifiers.
missing_district_ids = df_district_dim["district_id"].isnull().sum()

# Count missing state codes.
missing_state_codes = df_district_dim["state_code"].isnull().sum()

# Count duplicate district identifiers.
duplicate_district_ids = df_district_dim["district_id"].duplicated().sum()

# Display validation results.
validation_results = pd.DataFrame({
    "validation_check": [
        "Missing district IDs",
        "Missing state codes",
        "Duplicate district IDs"
    ],
    "result": [
        missing_district_ids,
        missing_state_codes,
        duplicate_district_ids
    ]
})

validation_results

	validation_check	result
0	Missing district IDs	0
1	Missing state codes	0
2	Duplicate district IDs	0

Creating the Year Dimension Table

To support longitudinal and multi-year analysis, I create a year dimension table.

This table connects NAEP reporting years and future multi-year education datasets through a consistent warehouse reporting year structure.

# Define the reporting years used in the project.
analysis_years = [
    2009,
    2011,
    2013,
    2015,
    2017,
    2019,
    2022,
    2024
]

# Create the year dimension table.
df_year_dim = pd.DataFrame({
    "year_key": range(1, len(analysis_years) + 1),
    "reporting_year": analysis_years
})

# Display the year dimension table.
df_year_dim

	year_key	reporting_year
0	1	2009
1	2	2011
2	3	2013
3	4	2015
4	5	2017
5	6	2019
6	7	2022
7	8	2024

Loading the Year Dimension Table into PostgreSQL

After creating the year dimension table, I load it into PostgreSQL so future CCD, NAEP, and CRDC datasets can connect to a consistent reporting year structure.

# Load the year dimension table into PostgreSQL.
df_year_dim.to_sql(
    name="dim_years",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("Year dimension table loaded successfully into PostgreSQL.")

Year dimension table loaded successfully into PostgreSQL.

Verifying the Year Dimension Table in PostgreSQL

To confirm the year dimension table loaded correctly, I query PostgreSQL and display the stored reporting years.

# Query PostgreSQL to display the year dimension table.
year_dimension_query = """
SELECT *
FROM public.dim_years
ORDER BY reporting_year;
"""

# Display the year dimension records.
pd.read_sql(year_dimension_query, engine)

	year_key	reporting_year
0	1	2009
1	2	2011
2	3	2013
3	4	2015
4	5	2017
5	6	2019
6	7	2022
7	8	2024

Adding a Surrogate Key to the State Dimension

To improve warehouse design and prepare for future multi-dataset integration, I create a surrogate key for the state dimension table.

Creating the State Dimension Table with Surrogate Keys

To support future warehouse joins and multi-dataset integration, I create a dedicated state dimension table with surrogate integer keys.

# Create a unique state dimension table from the district dimension dataset.
df_states_dim = (
    df_district_dim[
        ["state_code", "state_name"]
    ]
    .drop_duplicates()
    .sort_values("state_name")
    .reset_index(drop=True)
)

# Add a surrogate integer key for each state.
df_states_dim.insert(
    0,
    "state_key",
    range(1, len(df_states_dim) + 1)
)

# Display the state dimension table.
df_states_dim.head()

	state_key	state_code	state_name
0	1	AL	ALABAMA
1	2	AK	ALASKA
2	3	AS	AMERICAN SAMOA
3	4	AZ	ARIZONA
4	5	AR	ARKANSAS

Loading the Enhanced State Dimension into PostgreSQL

After adding surrogate keys to the state dimension table, I load the enhanced warehouse dimension into PostgreSQL.

# Load the enhanced state dimension table into PostgreSQL.
df_states_dim.to_sql(
    name="dim_states",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("Enhanced state dimension table loaded successfully into PostgreSQL.")

Enhanced state dimension table loaded successfully into PostgreSQL.

Verifying the Enhanced State Dimension in PostgreSQL

To confirm the enhanced state dimension table loaded correctly, I query PostgreSQL and display the stored surrogate keys and state records.

# Query PostgreSQL to display the enhanced state dimension table.
state_dimension_query = """
SELECT *
FROM public.dim_states
ORDER BY state_key;
"""

# Display the PostgreSQL state dimension table.
pd.read_sql(state_dimension_query, engine)

	state_key	state_code	state_name
0	1	AL	ALABAMA
1	2	AK	ALASKA
2	3	AS	AMERICAN SAMOA
3	4	AZ	ARIZONA
4	5	AR	ARKANSAS
5	6	BI	BUREAU OF INDIAN EDUCATION
6	7	CA	CALIFORNIA
7	8	CO	COLORADO
8	9	CT	CONNECTICUT
9	10	DE	DELAWARE
10	11	DC	DISTRICT OF COLUMBIA
11	12	FL	FLORIDA
12	13	GA	GEORGIA
13	14	GU	GUAM
14	15	HI	HAWAII
15	16	ID	IDAHO
16	17	IL	ILLINOIS
17	18	IN	INDIANA
18	19	IA	IOWA
19	20	KS	KANSAS
20	21	KY	KENTUCKY
21	22	LA	LOUISIANA
22	23	ME	MAINE
23	24	MD	MARYLAND
24	25	MA	MASSACHUSETTS
25	26	MI	MICHIGAN
26	27	MN	MINNESOTA
27	28	MS	MISSISSIPPI
28	29	MO	MISSOURI
29	30	MT	MONTANA
30	31	NE	NEBRASKA
31	32	NV	NEVADA
32	33	NH	NEW HAMPSHIRE
33	34	NJ	NEW JERSEY
34	35	NM	NEW MEXICO
35	36	NY	NEW YORK
36	37	NC	NORTH CAROLINA
37	38	ND	NORTH DAKOTA
38	39	MP	NORTHERN MARIANAS
39	40	OH	OHIO
40	41	OK	OKLAHOMA
41	42	OR	OREGON
42	43	PA	PENNSYLVANIA
43	44	PR	PUERTO RICO
44	45	RI	RHODE ISLAND
45	46	SC	SOUTH CAROLINA
46	47	SD	SOUTH DAKOTA
47	48	TN	TENNESSEE
48	49	TX	TEXAS
49	50	VI	U.S. VIRGIN ISLANDS
50	51	UT	UTAH
51	52	VT	VERMONT
52	53	VA	VIRGINIA
53	54	WA	WASHINGTON
54	55	WV	WEST VIRGINIA
55	56	WI	WISCONSIN
56	57	WY	WYOMING

Creating an Enhanced Fact Table with Surrogate Keys

To improve warehouse normalization and prepare for future multi-dataset integration, I create an enhanced fact table that connects to the state dimension using surrogate keys.

# Merge the existing district school counts fact table with the enhanced state dimension.
df_fact_enhanced = fact_district_school_counts.merge(
    df_states_dim[
        ["state_key", "state_code"]
    ],
    on="state_code",
    how="left"
)

# Reorder columns for warehouse modeling.
df_fact_enhanced = df_fact_enhanced[
    [
        "district_id",
        "state_key",
        "state_code",
        "operational_school_count"
    ]
]

# Display the enhanced fact table.
df_fact_enhanced.head()

	district_id	state_key	state_code	operational_school_count
0	100002	1	AL	0
1	100005	1	AL	7
2	100006	1	AL	17
3	100007	1	AL	18
4	100008	1	AL	13

Loading the Enhanced District School Counts Fact Table into PostgreSQL

After adding the state surrogate key to the fact table, I load the enhanced fact table into PostgreSQL for improved warehouse-style analytics.

# Load the enhanced district school counts fact table into PostgreSQL.
df_fact_enhanced.to_sql(
    name="fact_district_school_counts_enhanced",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("Enhanced district school counts fact table loaded successfully into PostgreSQL.")

Enhanced district school counts fact table loaded successfully into PostgreSQL.

Verifying the Enhanced Fact Table in PostgreSQL

To confirm the enhanced fact table loaded correctly, I query PostgreSQL and display the warehouse-ready records with surrogate state keys.

# Query PostgreSQL to display the enhanced fact table.
enhanced_fact_query = """
SELECT *
FROM public.fact_district_school_counts_enhanced
LIMIT 10;
"""

# Display the enhanced PostgreSQL fact table.
pd.read_sql(enhanced_fact_query, engine)

	district_id	state_key	state_code	operational_school_count
0	100002	1	AL	0
1	100005	1	AL	7
2	100006	1	AL	17
3	100007	1	AL	18
4	100008	1	AL	13
5	100009	1	AL	5
6	100010	1	AL	0
7	100011	1	AL	4
8	100012	1	AL	5
9	100013	1	AL	5

Analyzing Operational Schools Using the Enhanced Warehouse Model

To validate the enhanced warehouse model, I run an analytical SQL query using the enhanced fact table and the state dimension table.

This confirms that the surrogate-key relationship supports state-level aggregation correctly.

# SQL query using the enhanced fact table and state surrogate key relationship.
enhanced_state_analysis_query = read_sql_file("state_school_counts_with_surrogate_keys.sql")
# Display the top states using the enhanced warehouse model.
pd.read_sql(enhanced_state_analysis_query, engine)

	state_name	total_operational_schools
0	CALIFORNIA	10303.0
1	TEXAS	9192.0
2	NEW YORK	4810.0
3	ILLINOIS	4398.0
4	FLORIDA	4186.0
5	OHIO	3543.0
6	MICHIGAN	3468.0
7	PENNSYLVANIA	2932.0
8	MINNESOTA	2741.0
9	NORTH CAROLINA	2705.0

Visualizing Operational Schools Using the Enhanced Warehouse Model

To make the enhanced warehouse analysis easier to interpret, I create a visualization using the surrogate-key fact table and state dimension relationship.

# Store the enhanced SQL analysis result in a DataFrame.
enhanced_state_analysis_df = pd.read_sql(enhanced_state_analysis_query, engine)

# Create a horizontal bar chart.
plt.figure(figsize=(10, 6))
plt.barh(
    enhanced_state_analysis_df["state_name"],
    enhanced_state_analysis_df["total_operational_schools"]
)

# Display the highest values at the top.
plt.gca().invert_yaxis()

# Add chart labels and title.
plt.title("Top 10 States by Operational Schools - Enhanced Warehouse Model")
plt.xlabel("Total Operational Schools")
plt.ylabel("State")

# Improve chart spacing.
plt.subplots_adjust(left=0.28)

# Display the visualization.
plt.show()

Exploring the NAEP Dataset Structure

Before integrating NAEP data into the warehouse, I first inspect the structure of the raw NAEP files.

This helps identify available columns, reporting formats, and educational metrics across years.

# Load the NAEP dataset using the correct header row.
naep_clean_df = pd.read_excel(
    "../data/raw/naep/naep_math_grade8_state_2009_2024.Xls",
    header=8
)

# Display the cleaned NAEP dataset.
naep_clean_df.head()

	Year	Jurisdiction	All students	Average scale score
0	2024	National	All students	273.833456
1	2024	Alabama	All students	261.769896
2	2024	Alaska	All students	263.962653
3	2024	Arizona	All students	269.656164
4	2024	Arkansas	All students	266.194834

Cleaning and Standardizing the NAEP Dataset

To prepare the NAEP dataset for warehouse integration, I standardize column names and select the core analytical fields.

# Standardize NAEP column names.
naep_clean_df.columns = [
    "reporting_year",
    "jurisdiction",
    "student_group",
    "average_scale_score"
]

# Keep only the required analytical columns.
naep_clean_df = naep_clean_df[
    [
        "reporting_year",
        "jurisdiction",
        "student_group",
        "average_scale_score"
    ]
]

# Display the cleaned NAEP dataset.
naep_clean_df.head()

	reporting_year	jurisdiction	student_group	average_scale_score
0	2024	National	All students	273.833456
1	2024	Alabama	All students	261.769896
2	2024	Alaska	All students	263.962653
3	2024	Arizona	All students	269.656164
4	2024	Arkansas	All students	266.194834

Validating the NAEP Dataset

Before integrating NAEP data into the warehouse, I perform several validation checks to confirm data quality and completeness.

# Count missing values in the NAEP dataset.
naep_missing_values = naep_clean_df.isnull().sum()

# Count total NAEP records.
naep_total_records = len(naep_clean_df)

# Display validation results.
naep_validation_results = pd.DataFrame({
    "validation_check": [
        "Total NAEP records",
        "Missing reporting years",
        "Missing jurisdictions",
        "Missing student groups",
        "Missing average scale scores"
    ],
    "result": [
        naep_total_records,
        naep_missing_values["reporting_year"],
        naep_missing_values["jurisdiction"],
        naep_missing_values["student_group"],
        naep_missing_values["average_scale_score"]
    ]
})

# Display NAEP validation results.
naep_validation_results

	validation_check	result
0	Total NAEP records	490
1	Missing reporting years	1
2	Missing jurisdictions	4
3	Missing student groups	4
4	Missing average scale scores	4

Removing Invalid NAEP Records

To improve warehouse data quality, I remove incomplete NAEP records containing missing analytical values.

# Remove rows with missing critical analytical values.
naep_clean_df = naep_clean_df.dropna(
    subset=[
        "reporting_year",
        "jurisdiction",
        "student_group",
        "average_scale_score"
    ]
)

# Reset the index after cleaning.
naep_clean_df = naep_clean_df.reset_index(drop=True)

# Display the cleaned NAEP dataset.
naep_clean_df.head()

	reporting_year	jurisdiction	student_group	average_scale_score
0	2024	National	All students	273.833456
1	2024	Alabama	All students	261.769896
2	2024	Alaska	All students	263.962653
3	2024	Arizona	All students	269.656164
4	2024	Arkansas	All students	266.194834

Connecting NAEP Records to the Year Dimension

To prepare NAEP data for warehouse integration, I connect each NAEP record to the year dimension using the reporting year.

# Merge NAEP records with the year dimension to add year_key.
naep_with_year_key = naep_clean_df.merge(
    df_year_dim,
    on="reporting_year",
    how="left"
)

# Display NAEP records with year dimension keys.
naep_with_year_key.head()

	reporting_year	jurisdiction	student_group	average_scale_score	year_key
0	2024	National	All students	273.833456	8.0
1	2024	Alabama	All students	261.769896	8.0
2	2024	Alaska	All students	263.962653	8.0
3	2024	Arizona	All students	269.656164	8.0
4	2024	Arkansas	All students	266.194834	8.0

Connecting NAEP Records to the State Dimension

To integrate NAEP educational outcomes into the warehouse model, I connect NAEP jurisdictions to the state dimension table.

# Merge NAEP records with the state dimension table.
naep_warehouse_df = naep_with_year_key.merge(
    df_states_dim[
        ["state_key", "state_name"]
    ],
    left_on="jurisdiction",
    right_on="state_name",
    how="left"
)

# Display NAEP records connected to warehouse state keys.
naep_warehouse_df.head()

	reporting_year	jurisdiction	student_group	average_scale_score	year_key	state_key	state_name
0	2024	National	All students	273.833456	8.0	NaN	NaN
1	2024	Alabama	All students	261.769896	8.0	NaN	NaN
2	2024	Alaska	All students	263.962653	8.0	NaN	NaN
3	2024	Arizona	All students	269.656164	8.0	NaN	NaN
4	2024	Arkansas	All students	266.194834	8.0	NaN	NaN

Standardizing NAEP Jurisdiction Names for Warehouse Joins

Before joining NAEP records to the state dimension, I standardize jurisdiction names so they match the uppercase format used in the warehouse dimension table.

# Create a standardized uppercase jurisdiction field for warehouse joins.
naep_with_year_key["state_name"] = naep_with_year_key["jurisdiction"].str.upper()

# Merge NAEP records with the state dimension table using standardized state names.
naep_warehouse_df = naep_with_year_key.merge(
    df_states_dim[
        ["state_key", "state_name"]
    ],
    on="state_name",
    how="left"
)

# Display NAEP records connected to warehouse state keys.
naep_warehouse_df.head()

	reporting_year	jurisdiction	student_group	average_scale_score	year_key	state_name	state_key
0	2024	National	All students	273.833456	8.0	NATIONAL	NaN
1	2024	Alabama	All students	261.769896	8.0	ALABAMA	1.0
2	2024	Alaska	All students	263.962653	8.0	ALASKA	2.0
3	2024	Arizona	All students	269.656164	8.0	ARIZONA	4.0
4	2024	Arkansas	All students	266.194834	8.0	ARKANSAS	5.0

Creating the NAEP Math Performance Fact Table

After connecting NAEP records to the year and state dimensions, I create a warehouse-ready fact table for Grade 8 mathematics performance.

This table stores state-level NAEP math scores by reporting year and connects to the warehouse using surrogate keys.

# Create a NAEP math performance fact table using warehouse surrogate keys.
fact_naep_math_performance = naep_warehouse_df[
    [
        "state_key",
        "year_key",
        "jurisdiction",
        "student_group",
        "average_scale_score"
    ]
].copy()

# Remove records that did not match the state dimension.
# This excludes national aggregate rows and other non-state records.
fact_naep_math_performance = fact_naep_math_performance.dropna(
    subset=[
        "state_key",
        "year_key",
        "average_scale_score"
    ]
)

# Convert surrogate keys to integers after removing missing values.
fact_naep_math_performance["state_key"] = fact_naep_math_performance["state_key"].astype(int)
fact_naep_math_performance["year_key"] = fact_naep_math_performance["year_key"].astype(int)

# Display the NAEP math performance fact table.
fact_naep_math_performance.head()

	state_key	year_key	jurisdiction	student_group	average_scale_score
1	1	8	Alabama	All students	261.769896
2	2	8	Alaska	All students	263.962653
3	4	8	Arizona	All students	269.656164
4	5	8	Arkansas	All students	266.194834
5	7	8	California	All students	268.779818

Loading the NAEP Math Performance Fact Table into PostgreSQL

After creating the NAEP math performance fact table, I load it into PostgreSQL so it can be queried with SQL and used for longitudinal educational outcome analysis.

# Load the NAEP math performance fact table into PostgreSQL.
fact_naep_math_performance.to_sql(
    name="fact_naep_math_performance",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("NAEP math performance fact table loaded successfully into PostgreSQL.")

NAEP math performance fact table loaded successfully into PostgreSQL.

Verifying the NAEP Math Performance Fact Table in PostgreSQL

To confirm the NAEP math performance fact table loaded correctly, I query PostgreSQL and preview the stored educational outcome records.

# Query PostgreSQL to preview the NAEP math performance fact table.
naep_math_preview_query = """
SELECT *
FROM public.fact_naep_math_performance
LIMIT 10;
"""

# Display the NAEP math performance records.
pd.read_sql(naep_math_preview_query, engine)

	state_key	year_key	jurisdiction	student_group	average_scale_score
0	1	8	Alabama	All students	261.769896296059
1	2	8	Alaska	All students	263.962652757225
2	4	8	Arizona	All students	269.656164092934
3	5	8	Arkansas	All students	266.194833728906
4	7	8	California	All students	268.779817602202
5	8	8	Colorado	All students	278.107689530909
6	9	8	Connecticut	All students	276.674248483695
7	10	8	Delaware	All students	263.057137020663
8	11	8	District of Columbia	All students	261.538791377155
9	12	8	Florida	All students	267.242907869627

Analyzing Top States by NAEP Grade 8 Math Performance

To begin analyzing educational outcomes, I query PostgreSQL to identify the top 10 states by Grade 8 NAEP mathematics average scale score.

# SQL query to identify the top 10 states by NAEP Grade 8 math performance.
top_naep_math_states_query = read_sql_file("naep_top_states_2024.sql")
# Display the top 10 states by NAEP Grade 8 math score.
pd.read_sql(top_naep_math_states_query, engine)

	state_name	reporting_year	average_scale_score
0	MASSACHUSETTS	2024	283.465558985794
1	WISCONSIN	2024	282.649633523366
2	MINNESOTA	2024	282.087420612381
3	UTAH	2024	281.79013940125
4	NEW JERSEY	2024	281.684697993032
5	SOUTH DAKOTA	2024	281.101137171665
6	NEBRASKA	2024	279.8834957478
7	NORTH DAKOTA	2024	279.765445425712
8	NEW HAMPSHIRE	2024	279.7446422881
9	MONTANA	2024	279.082095551632

Visualizing Top NAEP Grade 8 Math Performance States

To better interpret educational outcome performance, I create a visualization showing the top 10 states by 2024 NAEP Grade 8 mathematics average scale score.

# Store the SQL query result in a DataFrame.
top_naep_math_states_df = pd.read_sql(
    top_naep_math_states_query,
    engine
)

# Convert scores to numeric values and round for readability.
top_naep_math_states_df["average_scale_score"] = pd.to_numeric(
    top_naep_math_states_df["average_scale_score"]
).round(2)

# Sort values so the highest-performing states appear at the top.
top_naep_math_states_df = top_naep_math_states_df.sort_values(
    by="average_scale_score",
    ascending=True
)

# Set a focused baseline so differences between close NAEP scores are visible.
score_baseline = 260

# Create the chart.
plt.figure(figsize=(14, 7))

plt.barh(
    top_naep_math_states_df["state_name"],
    top_naep_math_states_df["average_scale_score"] - score_baseline,
    left=score_baseline
)

# Add score labels at the end of each bar.
for index, value in enumerate(top_naep_math_states_df["average_scale_score"]):
    plt.text(
        value + 0.15,
        index,
        f"{value:.2f}",
        va="center"
    )

# Force the state names to display on the y-axis.
plt.yticks(
    range(len(top_naep_math_states_df)),
    top_naep_math_states_df["state_name"]
)

# Focus the x-axis range for better comparison visibility.
plt.xlim(278, 285)

# Add chart title and labels.
plt.title("Top 10 States by 2024 NAEP Grade 8 Math Scores")
plt.xlabel("Average Scale Score")
plt.ylabel("State")

# Add enough left margin so state names do not get clipped.
plt.subplots_adjust(left=0.28)

# Display chart.
plt.show()

Analyzing Lowest States by NAEP Grade 8 Math Performance

To better understand educational performance variation across states, I also analyze the lowest-performing states by 2024 NAEP Grade 8 mathematics average scale score.

# SQL query to identify the lowest 10 states by NAEP Grade 8 math performance.
lowest_naep_math_states_query = read_sql_file("naep_lowest_states_2024.sql")
# Display the lowest-performing states by NAEP Grade 8 math score.
pd.read_sql(lowest_naep_math_states_query, engine)

	state_name	reporting_year	average_scale_score
0	PUERTO RICO	2024	216.087040632243
1	NEW MEXICO	2024	256.210854081077
2	WEST VIRGINIA	2024	260.769078185284
3	DISTRICT OF COLUMBIA	2024	261.538791377155
4	ALABAMA	2024	261.769896296059
5	DELAWARE	2024	263.057137020663
6	ALASKA	2024	263.962652757225
7	OKLAHOMA	2024	264.454031154883
8	NEVADA	2024	265.177161678326
9	ARKANSAS	2024	266.194833728906

Visualizing Lowest NAEP Grade 8 Math Performance States

To better understand lower-performing educational outcome trends, I create a visualization showing the lowest-performing states by 2024 NAEP Grade 8 mathematics average scale score. Puerto Rico appears in the NAEP source data but is excluded from this state-level visualization because the chart focuses on U.S. states only.

# Store the SQL query result in a DataFrame.
lowest_naep_math_states_df = pd.read_sql(
    lowest_naep_math_states_query,
    engine
)

# Convert scores to numeric values and round for readability.
lowest_naep_math_states_df["average_scale_score"] = pd.to_numeric(
    lowest_naep_math_states_df["average_scale_score"]
).round(2)

# Remove Puerto Rico so the visualization scale remains readable.
lowest_naep_math_states_df = lowest_naep_math_states_df[
    lowest_naep_math_states_df["state_name"] != "PUERTO RICO"
]

# Sort values for visualization.
lowest_naep_math_states_df = lowest_naep_math_states_df.sort_values(
    by="average_scale_score",
    ascending=True
)

# Create the chart.
plt.figure(figsize=(14, 7))

plt.barh(
    lowest_naep_math_states_df["state_name"],
    lowest_naep_math_states_df["average_scale_score"]
)

# Add score labels.
for index, value in enumerate(lowest_naep_math_states_df["average_scale_score"]):
    plt.text(
        value + 0.1,
        index,
        f"{value:.2f}",
        va="center"
    )

# Focus the x-axis range.
plt.xlim(255, 267)

# Add chart labels and title.
plt.title("Lowest Performing States by 2024 NAEP Grade 8 Math Scores")
plt.xlabel("Average Scale Score")
plt.ylabel("State")

# Improve spacing so labels fit properly.
plt.subplots_adjust(left=0.28)

# Display chart.
plt.show()

Analyzing NAEP Grade 8 Math Trends Over Time

To begin longitudinal analysis, I query PostgreSQL to calculate the national trend in NAEP Grade 8 mathematics average scale scores across reporting years.

# SQL query to analyze the national NAEP Grade 8 math trend over time.
naep_national_trend_query = read_sql_file("naep_national_trend.sql")
# Display the national NAEP math performance trend.
pd.read_sql(naep_national_trend_query, engine)

	reporting_year	average_scale_score
0	2009	298.854346618482
1	2011	298.512431105164
2	2013	300.568235011147
3	2015	296.908571436848
4	2017	297.0422252108
5	2019	294.474574539442
6	2022	283.587656613142
7	2024	283.465558985794

Visualizing NAEP Grade 8 Math Trends Over Time

To better understand longitudinal educational performance patterns, I create a time-series visualization showing Massachusetts NAEP Grade 8 mathematics performance trends across reporting years.

# Store the SQL query result in a DataFrame.
naep_trend_df = pd.read_sql(
    naep_national_trend_query,
    engine
)

# Convert values to numeric types.
naep_trend_df["reporting_year"] = pd.to_numeric(
    naep_trend_df["reporting_year"]
)

naep_trend_df["average_scale_score"] = pd.to_numeric(
    naep_trend_df["average_scale_score"]
).round(2)

# Create the visualization.
plt.figure(figsize=(12, 6))

plt.plot(
    naep_trend_df["reporting_year"],
    naep_trend_df["average_scale_score"],
    marker="o",
    linewidth=3
)

# Add labels to each data point.
for x, y in zip(
    naep_trend_df["reporting_year"],
    naep_trend_df["average_scale_score"]
):
    plt.text(
        x,
        y + 0.4,
        f"{y:.2f}",
        ha="center"
    )

# Add chart title and labels.
plt.title("Massachusetts NAEP Grade 8 Math Trends Over Time")
plt.xlabel("Reporting Year")
plt.ylabel("Average Scale Score")

# Improve readability.
plt.grid(True)

# Display chart.
plt.show()

Educational Trend Interpretation

The Massachusetts NAEP Grade 8 mathematics trend analysis shows relatively stable high performance from 2009 through 2019, followed by a substantial decline beginning in 2022.

Several potential contributing factors may explain this pattern:

Post-pandemic educational disruption
Remote learning transition challenges
Unequal access to educational technology resources
Student engagement and learning loss
Changes in instructional delivery models

This longitudinal trend demonstrates how educational outcome analysis can be integrated into a data warehouse environment to support policy evaluation, educational research, and technology impact assessment initiatives.

Comparing Longitudinal NAEP Trends Across Multiple States

To better understand variation in educational outcomes across states, I compare long-term NAEP Grade 8 mathematics trends for several high-performing states.

# SQL query to compare longitudinal trends across multiple states.
multi_state_trend_query = read_sql_file("naep_multi_state_trend.sql")
# Display the multi-state trend analysis dataset.
pd.read_sql(multi_state_trend_query, engine)

	state_name	reporting_year	average_scale_score
0	MASSACHUSETTS	2009	298.854346618482
1	MASSACHUSETTS	2011	298.512431105164
2	MASSACHUSETTS	2013	300.568235011147
3	MASSACHUSETTS	2015	296.908571436848
4	MASSACHUSETTS	2017	297.0422252108
5	MASSACHUSETTS	2019	294.474574539442
6	MASSACHUSETTS	2022	283.587656613142
7	MASSACHUSETTS	2024	283.465558985794
8	MINNESOTA	2009	294.443320106986
9	MINNESOTA	2011	294.946422810345
10	MINNESOTA	2013	294.592972603507
11	MINNESOTA	2015	294.147777973073
12	MINNESOTA	2017	293.962375743927
13	MINNESOTA	2019	290.785141202164
14	MINNESOTA	2022	280.066020971965
15	MINNESOTA	2024	282.087420612381
16	NEW JERSEY	2009	292.657227109585
17	NEW JERSEY	2011	294.138771366004
18	NEW JERSEY	2013	296.053351131027
19	NEW JERSEY	2015	293.36593817877
20	NEW JERSEY	2017	291.698552399307
21	NEW JERSEY	2019	291.822771925628
22	NEW JERSEY	2022	280.886098306047
23	NEW JERSEY	2024	281.684697993032
24	UTAH	2009	284.068251870423
25	UTAH	2011	283.308332086503
26	UTAH	2013	284.331486246838
27	UTAH	2015	286.116911188659
28	UTAH	2017	286.816559110458
29	UTAH	2019	284.930389419443
30	UTAH	2022	282.211027508016
31	UTAH	2024	281.79013940125

Visualizing Multi-State NAEP Grade 8 Math Trends

To compare educational performance patterns over time, I create a line chart showing NAEP Grade 8 mathematics trends across multiple high-performing states.

# Store the multi-state SQL query result in a DataFrame.
multi_state_trend_df = pd.read_sql(
    multi_state_trend_query,
    engine
)

# Convert columns to numeric values for plotting.
multi_state_trend_df["reporting_year"] = pd.to_numeric(
    multi_state_trend_df["reporting_year"]
)

multi_state_trend_df["average_scale_score"] = pd.to_numeric(
    multi_state_trend_df["average_scale_score"]
).round(2)

# Create the multi-state trend visualization.
plt.figure(figsize=(12, 6))

for state_name in multi_state_trend_df["state_name"].unique():
    state_data = multi_state_trend_df[
        multi_state_trend_df["state_name"] == state_name
    ]

    plt.plot(
        state_data["reporting_year"],
        state_data["average_scale_score"],
        marker="o",
        linewidth=2,
        label=state_name
    )

# Add chart title and labels.
plt.title("NAEP Grade 8 Math Trends Across Selected States")
plt.xlabel("Reporting Year")
plt.ylabel("Average Scale Score")

# Add legend and grid.
plt.legend()
plt.grid(True)

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

Comparative Educational Trend Interpretation

The multi-state NAEP trend analysis reveals several important educational performance patterns across high-performing states.

Key observations include:

Massachusetts consistently maintains the highest overall mathematics performance levels.
Multiple states experienced substantial score declines beginning after 2019.
The post-pandemic educational disruption appears visible across nearly all analyzed states.
Some states demonstrate stronger recovery stabilization between 2022 and 2024.
Longitudinal educational outcome analysis provides valuable insight into statewide instructional resilience and educational system performance.

This comparative analysis demonstrates how dimensional warehouse modeling and longitudinal analytics can support large-scale educational research and policy evaluation initiatives.

Analyzing Post-Pandemic NAEP Math Score Changes

To quantify the post-pandemic change in educational outcomes, I compare each selected state’s 2019 NAEP Grade 8 mathematics score against its 2024 score.

# SQL query to compare 2019 and 2024 NAEP Grade 8 math scores.
post_pandemic_change_query = read_sql_file("naep_post_pandemic_delta.sql")
# Display the post-pandemic score change results.
pd.read_sql(post_pandemic_change_query, engine)

	state_name	score_2019	score_2024	score_change
0	MASSACHUSETTS	294.474575	283.465559	-11.009016
1	NEW JERSEY	291.822772	281.684698	-10.138074
2	MINNESOTA	290.785141	282.087421	-8.697721
3	UTAH	284.930389	281.790139	-3.140250

Data Extraction — CRDC LEA Characteristics Dataset

To begin integrating education technology and district-level characteristics into the warehouse, I load the CRDC LEA Characteristics dataset.

This dataset will help support future analysis of district characteristics, technology access indicators, and educational equity patterns.

# Define the relative path to the CRDC LEA Characteristics dataset.
crdc_file_path = "../data/raw/crdc/LEA Characteristics.csv"

# Load the CRDC CSV file into a pandas DataFrame.
df_crdc = pd.read_csv(crdc_file_path)

# Display the first five records to confirm the dataset loaded correctly.
df_crdc.head()

C:\Users\User\AppData\Local\Temp\ipykernel_21316\130108529.py:5: DtypeWarning: Columns (0: LEAID) have mixed types. Specify dtype option on import or set low_memory=False.
  df_crdc = pd.read_csv(crdc_file_path)

	LEA_STATE	LEA_STATE_NAME	LEAID	LEA_NAME	LEA_ADDRESS	LEA_CITY	LEA_ZIP	CJJ	LEA_CRCOORD_SEX_IND	LEA_CRCOORD_RAC_IND	...	LEA_HBPOLICY_IND	LEA_HBPOLICYURL_IND	LEA_HBPOLICY_URL	LEA_ENR	LEA_PSENR_A3	LEA_PSENR_A4	LEA_PSENR_A5	LEA_ENR_NONLEAFAC	LEA_SCHOOLS	LEA_PS_IND
0	MA	MASSACHUSETTS	2506090	Hingham	220 Central Street	Hingham	2043	No	Yes	Yes	...	Yes	Yes	https://z2policy.ctspublish.com/masc/browse/hi...	3912	0	0	24	38	6	Yes
1	MI	MICHIGAN	2627210	Owosso Public Schools	645 Alger Ave	Owosso	48867	No	Yes	Yes	...	Yes	Yes	owosso.k12.mi.us	2994	34	41	0	0	7	Yes
2	TX	TEXAS	4800278	CROSSTIMBERS ACADEMY	236 HARMONY RD	WEATHERFORD	76087	No	Yes	Yes	...	Yes	Yes	https://ctacharter.com/	149	-9	-9	-9	0	1	No
3	OK	OKLAHOMA	4012390	GANS	204 N STACY	GANS	74936	No	Yes	Yes	...	Yes	Yes	www.gans.k12.ok.us	351	16	22	30	0	2	Yes
4	CA	CALIFORNIA	0619230	Junction Elementary	98821 Highway 96	Somes Bar	95568	No	Yes	Yes	...	Yes	No	-9	16	-9	-9	-9	0	1	No

5 rows × 40 columns

Initial Data Inspection — CRDC Dataset

In this step, I inspect the CRDC dataset to understand its size, structure, and available fields before selecting columns for warehouse integration.

# Display the number of rows and columns in the CRDC dataset.
crdc_row_count, crdc_column_count = df_crdc.shape

print(f"Number of CRDC rows: {crdc_row_count}")
print(f"Number of CRDC columns: {crdc_column_count}")

# Display all CRDC column names to identify fields useful for district characteristics and technology access analysis.
df_crdc.columns.tolist()

Number of CRDC rows: 17704
Number of CRDC columns: 40

['LEA_STATE',
 'LEA_STATE_NAME',
 'LEAID',
 'LEA_NAME',
 'LEA_ADDRESS',
 'LEA_CITY',
 'LEA_ZIP',
 'CJJ',
 'LEA_CRCOORD_SEX_IND',
 'LEA_CRCOORD_RAC_IND',
 'LEA_CRCOORD_DIS_IND',
 'LEA_CRCOORD_SEX_FN',
 'XLEA_CRCOORD_SEX_FN',
 'LEA_CRCOORD_SEX_LN',
 'XLEA_CRCOORD_SEX_LN',
 'LEA_CRCOORD_SEX_EM',
 'XLEA_CRCOORD_SEX_EM',
 'LEA_CRCOORD_RAC_FN',
 'XLEA_CRCOORD_RAC_FN',
 'LEA_CRCOORD_RAC_LN',
 'XLEA_CRCOORD_RAC_LN',
 'LEA_CRCOORD_RAC_EM',
 'XLEA_CRCOORD_RAC_EM',
 'LEA_CRCOORD_DIS_FN',
 'XLEA_CRCOORD_DIS_FN',
 'LEA_CRCOORD_DIS_LN',
 'XLEA_CRCOORD_DIS_LN',
 'LEA_CRCOORD_DIS_EM',
 'XLEA_CRCOORD_DIS_EM',
 'LEA_DESEGPLAN',
 'LEA_HBPOLICY_IND',
 'LEA_HBPOLICYURL_IND',
 'LEA_HBPOLICY_URL',
 'LEA_ENR',
 'LEA_PSENR_A3',
 'LEA_PSENR_A4',
 'LEA_PSENR_A5',
 'LEA_ENR_NONLEAFAC',
 'LEA_SCHOOLS',
 'LEA_PS_IND']

Selecting CRDC District Characteristic Columns

After inspecting the CRDC dataset, I select district-level fields that can support warehouse integration and future educational equity analysis.

# Select CRDC district characteristic columns for warehouse preparation.
crdc_columns = [
    "LEAID",
    "LEA_NAME",
    "LEA_STATE",
    "LEA_STATE_NAME",
    "LEA_CITY",
    "LEA_ENR",
    "LEA_SCHOOLS",
    "LEA_DESEGPLAN",
    "LEA_HBPOLICY_IND",
    "LEA_PS_IND"
]

# Create a district-focused CRDC DataFrame.
df_crdc_districts = df_crdc[crdc_columns].copy()

# Display the selected CRDC district characteristic fields.
df_crdc_districts.head()

	LEAID	LEA_NAME	LEA_STATE	LEA_STATE_NAME	LEA_CITY	LEA_ENR	LEA_SCHOOLS	LEA_DESEGPLAN	LEA_HBPOLICY_IND	LEA_PS_IND
0	2506090	Hingham	MA	MASSACHUSETTS	Hingham	3912	6	No	Yes	Yes
1	2627210	Owosso Public Schools	MI	MICHIGAN	Owosso	2994	7	No	Yes	Yes
2	4800278	CROSSTIMBERS ACADEMY	TX	TEXAS	WEATHERFORD	149	1	No	Yes	No
3	4012390	GANS	OK	OKLAHOMA	GANS	351	2	No	Yes	Yes
4	0619230	Junction Elementary	CA	CALIFORNIA	Somes Bar	16	1	No	Yes	No

Cleaning the CRDC District Characteristics Dataset

Before loading CRDC district characteristics into PostgreSQL, I standardize column names and remove duplicate district records.

# Create a copy of the selected CRDC district fields.
df_crdc_district_dim = df_crdc_districts.copy()

# Remove duplicate district records using the LEAID identifier.
df_crdc_district_dim = df_crdc_district_dim.drop_duplicates(subset=["LEAID"])

# Standardize column names for PostgreSQL compatibility.
df_crdc_district_dim.columns = [
    "district_id",
    "district_name",
    "state_code",
    "state_name",
    "city",
    "enrollment",
    "school_count",
    "desegregation_plan_indicator",
    "harassment_bullying_policy_indicator",
    "preschool_indicator"
]

# Reset the index after cleaning.
df_crdc_district_dim = df_crdc_district_dim.reset_index(drop=True)

# Display the cleaned CRDC district characteristics dataset.
df_crdc_district_dim.head()

	district_id	district_name	state_code	state_name	city	enrollment	school_count	desegregation_plan_indicator	harassment_bullying_policy_indicator	preschool_indicator
0	2506090	Hingham	MA	MASSACHUSETTS	Hingham	3912	6	No	Yes	Yes
1	2627210	Owosso Public Schools	MI	MICHIGAN	Owosso	2994	7	No	Yes	Yes
2	4800278	CROSSTIMBERS ACADEMY	TX	TEXAS	WEATHERFORD	149	1	No	Yes	No
3	4012390	GANS	OK	OKLAHOMA	GANS	351	2	No	Yes	Yes
4	0619230	Junction Elementary	CA	CALIFORNIA	Somes Bar	16	1	No	Yes	No

Loading CRDC District Characteristics into PostgreSQL

After cleaning the CRDC district characteristics dataset, I load it into PostgreSQL as a warehouse table for future district-level enrichment analysis.

# Load the cleaned CRDC district characteristics table into PostgreSQL.
df_crdc_district_dim.to_sql(
    name="dim_crdc_district_characteristics",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("CRDC district characteristics table loaded successfully into PostgreSQL.")

CRDC district characteristics table loaded successfully into PostgreSQL.

Verifying the CRDC District Characteristics Table in PostgreSQL

To confirm the CRDC district characteristics table loaded correctly, I query PostgreSQL and preview the stored records.

# Query PostgreSQL to preview the CRDC district characteristics table.
crdc_preview_query = """
SELECT *
FROM public.dim_crdc_district_characteristics
LIMIT 10;
"""

# Display the CRDC district characteristics records.
pd.read_sql(crdc_preview_query, engine)

	district_id	district_name	state_code	state_name	city	enrollment	school_count	desegregation_plan_indicator	harassment_bullying_policy_indicator	preschool_indicator
0	2506090	Hingham	MA	MASSACHUSETTS	Hingham	3912	6	No	Yes	Yes
1	2627210	Owosso Public Schools	MI	MICHIGAN	Owosso	2994	7	No	Yes	Yes
2	4800278	CROSSTIMBERS ACADEMY	TX	TEXAS	WEATHERFORD	149	1	No	Yes	No
3	4012390	GANS	OK	OKLAHOMA	GANS	351	2	No	Yes	Yes
4	0619230	Junction Elementary	CA	CALIFORNIA	Somes Bar	16	1	No	Yes	No
5	2913290	GREEN RIDGE R-VIII	MO	MISSOURI	GREEN RIDGE	350	2	No	Yes	No
6	3000004	Mountain View Elem	MT	MONTANA	Cut Bank	11	1	No	Yes	No
7	0805940	Otis School District No. R-3	CO	COLORADO	OTIS	211	2	No	Yes	Yes
8	0602174	iLead Agua Dulce District	CA	CALIFORNIA	Agua Dulce	320	1	No	Yes	No
9	0624600	Merced City Elementary	CA	CALIFORNIA	Merced	10922	19	No	Yes	Yes

Validating the CRDC District Characteristics Dataset

Before using the CRDC district characteristics data for analysis, I perform validation checks to confirm completeness and uniqueness.

# Count missing district identifiers.
crdc_missing_district_ids = df_crdc_district_dim["district_id"].isnull().sum()

# Count missing state names.
crdc_missing_state_names = df_crdc_district_dim["state_name"].isnull().sum()

# Count missing enrollment values.
crdc_missing_enrollment = df_crdc_district_dim["enrollment"].isnull().sum()

# Count duplicate district identifiers.
crdc_duplicate_district_ids = df_crdc_district_dim["district_id"].duplicated().sum()

# Display validation results.
crdc_validation_results = pd.DataFrame({
    "validation_check": [
        "Missing district IDs",
        "Missing state names",
        "Missing enrollment values",
        "Duplicate district IDs"
    ],
    "result": [
        crdc_missing_district_ids,
        crdc_missing_state_names,
        crdc_missing_enrollment,
        crdc_duplicate_district_ids
    ]
})

crdc_validation_results

	validation_check	result
0	Missing district IDs	0
1	Missing state names	0
2	Missing enrollment values	0
3	Duplicate district IDs	0

Analyzing CRDC District Enrollment by State

To begin using the CRDC district characteristics table analytically, I summarize total district enrollment by state.

# SQL query to summarize total CRDC enrollment by state.
crdc_enrollment_by_state_query = read_sql_file("crdc_enrollment_by_state.sql")
# Display the top states by total CRDC district enrollment.
pd.read_sql(crdc_enrollment_by_state_query, engine)

	state_name	total_enrollment
0	CALIFORNIA	5958775.0
1	TEXAS	5419285.0
2	FLORIDA	2961487.0
3	NEW YORK	2745258.0
4	ILLINOIS	1882620.0
5	GEORGIA	1751403.0
6	OHIO	1717635.0
7	PENNSYLVANIA	1707999.0
8	NORTH CAROLINA	1537968.0
9	MICHIGAN	1441417.0

Visualizing CRDC District Enrollment by State

To better understand district enrollment distribution, I create a visualization showing the states with the highest total CRDC district enrollment.

# Store the CRDC enrollment query results in a DataFrame.
crdc_enrollment_df = pd.read_sql(
    crdc_enrollment_by_state_query,
    engine
)

# Convert enrollment values to numeric.
crdc_enrollment_df["total_enrollment"] = pd.to_numeric(
    crdc_enrollment_df["total_enrollment"]
)

# Sort values for visualization.
crdc_enrollment_df = crdc_enrollment_df.sort_values(
    by="total_enrollment",
    ascending=True
)

# Create the chart.
plt.figure(figsize=(12, 6))

plt.barh(
    crdc_enrollment_df["state_name"],
    crdc_enrollment_df["total_enrollment"]
)

# Add chart title and labels.
plt.title("Top States by Total CRDC District Enrollment")
plt.xlabel("Total Enrollment")
plt.ylabel("State")

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

Preparing CRDC and NAEP State-Level Comparison Data

To begin connecting district characteristics with educational outcomes, I aggregate CRDC enrollment data to the state level and prepare it for comparison with NAEP state performance results.

# SQL query to aggregate CRDC district enrollment to the state level.
crdc_state_enrollment_query = """
SELECT
    state_name,
    SUM(CAST(enrollment AS NUMERIC)) AS total_enrollment
FROM public.dim_crdc_district_characteristics
GROUP BY state_name
ORDER BY state_name;
"""

# Display state-level CRDC enrollment totals.
pd.read_sql(crdc_state_enrollment_query, engine)

	state_name	total_enrollment
0	ALABAMA	738221.0
1	ALASKA	129730.0
2	ARIZONA	1127982.0
3	ARKANSAS	490007.0
4	CALIFORNIA	5958775.0
5	COLORADO	886633.0
6	CONNECTICUT	510407.0
7	DELAWARE	140631.0
8	DISTRICT OF COLUMBIA	89300.0
9	FLORIDA	2961487.0
10	GEORGIA	1751403.0
11	HAWAII	173178.0
12	IDAHO	315523.0
13	ILLINOIS	1882620.0
14	INDIANA	1038088.0
15	IOWA	510553.0
16	KANSAS	481676.0
17	KENTUCKY	653132.0
18	LOUISIANA	664811.0
19	MAINE	164334.0
20	MARYLAND	883833.0
21	MASSACHUSETTS	919041.0
22	MICHIGAN	1441417.0
23	MINNESOTA	890878.0
24	MISSISSIPPI	485450.0
25	MISSOURI	899203.0
26	MONTANA	160320.0
27	NEBRASKA	325764.0
28	NEVADA	486753.0
29	NEW HAMPSHIRE	171005.0
30	NEW JERSEY	1377021.0
31	NEW MEXICO	309714.0
32	NEW YORK	2745258.0
33	NORTH CAROLINA	1537968.0
34	NORTH DAKOTA	116594.0
35	OHIO	1717635.0
36	OKLAHOMA	699604.0
37	OREGON	553921.0
38	PENNSYLVANIA	1707999.0
39	PUERTO RICO	259535.0
40	RHODE ISLAND	181359.0
41	SOUTH CAROLINA	742636.0
42	SOUTH DAKOTA	141475.0
43	TENNESSEE	989550.0
44	TEXAS	5419285.0
45	UTAH	705970.0
46	VERMONT	82602.0
47	VIRGINIA	1291606.0
48	WASHINGTON	1091100.0
49	WEST VIRGINIA	248994.0
50	WISCONSIN	829306.0
51	WYOMING	92493.0

Joining CRDC Enrollment with 2024 NAEP Math Performance

To begin comparing district characteristics with educational outcomes, I join state-level CRDC enrollment totals with 2024 NAEP Grade 8 mathematics performance scores.

# SQL query to join CRDC state enrollment totals with 2024 NAEP math performance.
crdc_naep_comparison_query = read_sql_file("crdc_naep_comparison.sql")
# Display the joined CRDC and NAEP comparison dataset.
pd.read_sql(crdc_naep_comparison_query, engine)

	state_name	total_enrollment	average_scale_score
0	MASSACHUSETTS	919041.0	283.465559
1	WISCONSIN	829306.0	282.649634
2	MINNESOTA	890878.0	282.087421
3	UTAH	705970.0	281.790139
4	NEW JERSEY	1377021.0	281.684698
5	SOUTH DAKOTA	141475.0	281.101137
6	NEBRASKA	325764.0	279.883496
7	NORTH DAKOTA	116594.0	279.765445
8	NEW HAMPSHIRE	171005.0	279.744642
9	MONTANA	160320.0	279.082096
10	OHIO	1717635.0	278.812721
11	WYOMING	92493.0	278.315324
12	INDIANA	1038088.0	278.185064
13	COLORADO	886633.0	278.107690
14	IDAHO	315523.0	278.104479
15	ILLINOIS	1882620.0	277.416366
16	CONNECTICUT	510407.0	276.674248
17	PENNSYLVANIA	1707999.0	276.245915
18	TENNESSEE	989550.0	275.982988
19	NORTH CAROLINA	1537968.0	275.821518
20	VERMONT	82602.0	275.617661
21	VIRGINIA	1291606.0	274.988690
22	IOWA	510553.0	274.823931
23	KANSAS	481676.0	274.136728
24	WASHINGTON	1091100.0	273.665174
25	MAINE	164334.0	272.621883
26	NEW YORK	2745258.0	270.851724
27	KENTUCKY	653132.0	270.813300
28	MISSOURI	899203.0	270.363197
29	HAWAII	173178.0	270.041103
30	MICHIGAN	1441417.0	269.978774
31	RHODE ISLAND	181359.0	269.826789
32	ARIZONA	1127982.0	269.656164
33	TEXAS	5419285.0	269.393041
34	MISSISSIPPI	485450.0	269.060485
35	CALIFORNIA	5958775.0	268.779818
36	GEORGIA	1751403.0	268.655805
37	MARYLAND	883833.0	268.213760
38	SOUTH CAROLINA	742636.0	268.006371
39	OREGON	553921.0	267.940476
40	FLORIDA	2961487.0	267.242908
41	LOUISIANA	664811.0	266.783264
42	ARKANSAS	490007.0	266.194834
43	NEVADA	486753.0	265.177162
44	OKLAHOMA	699604.0	264.454031
45	ALASKA	129730.0	263.962653
46	DELAWARE	140631.0	263.057137
47	ALABAMA	738221.0	261.769896
48	DISTRICT OF COLUMBIA	89300.0	261.538791
49	WEST VIRGINIA	248994.0	260.769078
50	NEW MEXICO	309714.0	256.210854
51	PUERTO RICO	259535.0	216.087041

Visualizing CRDC Enrollment and NAEP Math Performance

To begin exploring relationships between district characteristics and educational outcomes, I create a scatterplot comparing total CRDC district enrollment with 2024 NAEP Grade 8 mathematics performance.

# Store the joined CRDC and NAEP comparison query result in a DataFrame.
crdc_naep_comparison_df = pd.read_sql(
    crdc_naep_comparison_query,
    engine
)

# Convert numeric fields.
crdc_naep_comparison_df["total_enrollment"] = pd.to_numeric(
    crdc_naep_comparison_df["total_enrollment"]
)

crdc_naep_comparison_df["average_scale_score"] = pd.to_numeric(
    crdc_naep_comparison_df["average_scale_score"]
).round(2)

# Select the top 10 states by enrollment.
top_enrollment_states = crdc_naep_comparison_df.nlargest(
    10,
    "total_enrollment"
)

# Sort values for visualization.
top_enrollment_states = top_enrollment_states.sort_values(
    by="total_enrollment",
    ascending=True
)

# Create the chart.
plt.figure(figsize=(12, 6))

bars = plt.barh(
    top_enrollment_states["state_name"],
    top_enrollment_states["total_enrollment"]
)

# Add NAEP score labels to bars.
for bar, score in zip(
    bars,
    top_enrollment_states["average_scale_score"]
):
    plt.text(
        bar.get_width(),
        bar.get_y() + bar.get_height() / 2,
        f" NAEP: {score}",
        va="center"
    )

# Add chart title and labels.
plt.title("Top States by Enrollment with 2024 NAEP Math Scores")
plt.xlabel("Total State Enrollment")
plt.ylabel("State")

# Add additional chart spacing for labels.
plt.xlim(
    0,
    top_enrollment_states["total_enrollment"].max() * 1.15
)

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

CRDC and NAEP Comparison Interpretation

This comparison connects CRDC district enrollment totals with 2024 NAEP Grade 8 mathematics performance scores.

The purpose of this step is not to prove a causal relationship, but to begin exploring whether large enrollment systems show different performance patterns than smaller state systems.

Key observations from this comparison include:

Larger enrollment states do not automatically produce higher NAEP mathematics scores.
Some high-enrollment states perform near the middle of the NAEP score range.
Smaller or mid-sized states may outperform larger systems on average scale scores.
Enrollment size alone is not enough to explain academic performance outcomes.
Future CRDC technology-access indicators will provide stronger explanatory variables for technology impact analysis.

This step establishes the first bridge between district characteristics and educational outcomes inside the warehouse.

Analyzing Enrollment Size and Educational Performance

Now I want to compare state enrollment totals against 2024 NAEP Grade 8 math scores. Two changes from the obvious approach matter here.

First, I use pd.qcut instead of pd.cut for the size buckets. With states like California and Texas at the top of the distribution, equal-width bins would put roughly two states in “Large” and everything else in “Small” — the means wouldn’t be comparable. Quantile-based bins put a roughly equal count of states in each bucket, which makes the group averages meaningful.

Second, I compute a Spearman rank correlation alongside the bucket summary. The buckets give a quick visual; the correlation gives an honest single number for whether enrollment scale and NAEP scores actually move together.

from scipy.stats import spearmanr

# Working copy of the joined CRDC enrollment + NAEP 2024 dataset.
enrollment_performance_analysis_df = crdc_naep_comparison_df.copy()

# Quantile-based size buckets: each tercile holds roughly the same number of states.
enrollment_performance_analysis_df["enrollment_tercile"] = pd.qcut(
    enrollment_performance_analysis_df["total_enrollment"],
    q=3,
    labels=["Small (bottom third)", "Medium (middle third)", "Large (top third)"]
)

# Mean score plus state count per bucket — count matters; it tells you how
# unbalanced the buckets actually are.
enrollment_performance_summary = (
    enrollment_performance_analysis_df
    .groupby("enrollment_tercile", observed=True)
    .agg(
        states_in_bucket=("state_name", "count"),
        mean_naep_score=("average_scale_score", "mean"),
    )
    .reset_index()
)

# Spearman rank correlation between enrollment and NAEP score.
# Rank-based, so robust to California/Texas being extreme outliers.
rho, p_value = spearmanr(
    enrollment_performance_analysis_df["total_enrollment"],
    enrollment_performance_analysis_df["average_scale_score"],
)

print(enrollment_performance_summary.to_string(index=False))
print()
print(f"Spearman rank correlation: rho = {rho:.3f}, p-value = {p_value:.3f}")

   enrollment_tercile  states_in_bucket  mean_naep_score
 Small (bottom third)                18       268.881667
Medium (middle third)                17       271.464118
    Large (top third)                17       274.167059

Spearman rank correlation: rho = 0.104, p-value = 0.462

Visualizing Enrollment Size and Educational Performance

A scatter plot is the right shape for this question. Each point is one state; the x-axis is total enrollment, the y-axis is the 2024 NAEP Grade 8 math score. If there’s a relationship, you see it in the cloud.

fig, ax = plt.subplots(figsize=(11, 6))

ax.scatter(
    enrollment_performance_analysis_df["total_enrollment"],
    enrollment_performance_analysis_df["average_scale_score"],
    s=40,
)

# Label the most extreme points so the chart isn't anonymous.
extremes = pd.concat([
    enrollment_performance_analysis_df.nlargest(3, "total_enrollment"),
    enrollment_performance_analysis_df.nsmallest(3, "total_enrollment"),
    enrollment_performance_analysis_df.nlargest(2, "average_scale_score"),
    enrollment_performance_analysis_df.nsmallest(2, "average_scale_score"),
]).drop_duplicates(subset=["state_name"])

for _, row in extremes.iterrows():
    ax.annotate(
        row["state_name"].title(),
        (row["total_enrollment"], row["average_scale_score"]),
        xytext=(5, 5), textcoords="offset points", fontsize=9,
    )

ax.set_title(f"State Enrollment vs. 2024 NAEP Grade 8 Math Score (Spearman ρ = {rho:.2f})")
ax.set_xlabel("Total State Enrollment (CRDC)")
ax.set_ylabel("2024 NAEP Grade 8 Math — Average Scale Score")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Enrollment Size and Educational Performance Interpretation

The Spearman correlation is close to zero, which is the honest answer: across U.S. states in 2024, total enrollment and NAEP Grade 8 math performance are not meaningfully related. Massachusetts is mid-sized and tops the chart; Texas and California are huge and sit near the middle of the score distribution; the highest-scoring small state isn’t dramatically higher than the highest-scoring large state.

A few caveats worth being explicit about:

Aggregate state-level data can’t answer a district-level question. Two states with the same average can have very different distributions underneath.
Total enrollment is a proxy for “system size,” not for any specific causal factor. Funding, demographics, instructional approach, and dozens of other things vary across states.
The bucket means are informative but the buckets are small (~17 states each). Don’t read too much into small differences between bucket means.

The takeaway for this project is more methodological than substantive: enrollment scale alone isn’t useful as an explanatory variable for academic performance, so the more interesting question is whether technology participation — distance education enrollment specifically — shows a relationship. That’s what the next section explores.

Project Summary and Key Findings

This project demonstrates the development of an end-to-end educational technology impact analysis workflow using publicly available education and technology-related datasets.

The primary objective is to evaluate educational outcome trends and explore how district-level characteristics, enrollment patterns, and technology-enabled learning participation may relate to student performance.

To support this analysis, I:

Collected educational and technology-related datasets from multiple public sources
Built a PostgreSQL dimensional warehouse environment
Designed ETL workflows for data integration and transformation
Connected district, state, year, educational outcome, and distance education datasets
Performed SQL-based analytical reporting
Created visualizations to communicate educational trends and comparisons
Evaluated post-pandemic educational performance patterns
Integrated CRDC district characteristics and distance education participation metrics into the warehouse environment

Key findings from the current analysis include:

Educational performance varies significantly across states.
Post-pandemic NAEP mathematics scores declined across many educational systems.
Larger enrollment systems do not necessarily produce stronger educational outcomes.
Distance education participation varies widely across states and districts.
Higher distance education participation alone does not automatically correspond to stronger NAEP mathematics outcomes.
Technology adoption appears most meaningful when evaluated alongside implementation quality, access, instructional context, enrollment scale, and educational equity.

This project demonstrates practical experience in data engineering, analytics engineering, PostgreSQL development, ETL pipeline design, SQL analytics, data visualization, and research-oriented reporting workflows.

Loading the CRDC Distance Education Dataset

To evaluate technology adoption and digital learning participation, I load the CRDC Distance Education dataset into the analytics workflow.

# Load the CRDC Distance Education dataset.
distance_education_path = "../data/raw/crdc/Distance Education.csv"

df_distance_education = pd.read_csv(
    distance_education_path,
    encoding="latin1",
    low_memory=False
)

# Display dataset dimensions.
print("Number of rows:", df_distance_education.shape[0])
print("Number of columns:", df_distance_education.shape[1])

# Display column names.
df_distance_education.columns.tolist()

Number of rows: 17604
Number of columns: 29

['LEA_STATE',
 'LEA_STATE_NAME',
 'LEAID',
 'LEA_NAME',
 'LEA_ADDRESS',
 'LEA_CITY',
 'LEA_ZIP',
 'CJJ',
 'LEA_DISTED_IND',
 'LEA_DISTEDENR_HI_M',
 'LEA_DISTEDENR_HI_F',
 'LEA_DISTEDENR_AM_M',
 'LEA_DISTEDENR_AM_F',
 'LEA_DISTEDENR_AS_M',
 'LEA_DISTEDENR_AS_F',
 'LEA_DISTEDENR_HP_M',
 'LEA_DISTEDENR_HP_F',
 'LEA_DISTEDENR_BL_M',
 'LEA_DISTEDENR_BL_F',
 'LEA_DISTEDENR_WH_M',
 'LEA_DISTEDENR_WH_F',
 'LEA_DISTEDENR_TR_M',
 'LEA_DISTEDENR_TR_F',
 'TOT_DISTEDENR_M',
 'TOT_DISTEDENR_F',
 'LEA_DISTEDENR_LEP_M',
 'LEA_DISTEDENR_LEP_F',
 'LEA_DISTEDENR_IDEA_M',
 'LEA_DISTEDENR_IDEA_F']

Cleaning and Standardizing the Distance Education Dataset

Before integrating the distance education dataset into the warehouse workflow, I standardize column names and prepare the dataset for analytical processing.

# Create a cleaned copy of the distance education dataset.
df_distance_education_clean = df_distance_education.copy()

# Standardize column names.
df_distance_education_clean.columns = (
    df_distance_education_clean.columns
    .str.lower()
    .str.strip()
)

# Rename key columns for readability.
df_distance_education_clean = df_distance_education_clean.rename(columns={
    "lea_state_name": "state_name",
    "leaid": "district_id",
    "lea_name": "district_name",
    "lea_disted_ind": "distance_education_indicator",
    "tot_distedenr_m": "male_distance_education_enrollment",
    "tot_distedenr_f": "female_distance_education_enrollment"
})

# Display cleaned column names.
df_distance_education_clean.columns.tolist()

['lea_state',
 'state_name',
 'district_id',
 'district_name',
 'lea_address',
 'lea_city',
 'lea_zip',
 'cjj',
 'distance_education_indicator',
 'lea_distedenr_hi_m',
 'lea_distedenr_hi_f',
 'lea_distedenr_am_m',
 'lea_distedenr_am_f',
 'lea_distedenr_as_m',
 'lea_distedenr_as_f',
 'lea_distedenr_hp_m',
 'lea_distedenr_hp_f',
 'lea_distedenr_bl_m',
 'lea_distedenr_bl_f',
 'lea_distedenr_wh_m',
 'lea_distedenr_wh_f',
 'lea_distedenr_tr_m',
 'lea_distedenr_tr_f',
 'male_distance_education_enrollment',
 'female_distance_education_enrollment',
 'lea_distedenr_lep_m',
 'lea_distedenr_lep_f',
 'lea_distedenr_idea_m',
 'lea_distedenr_idea_f']

Analyzing Distance Education Adoption by State

To begin evaluating technology adoption patterns, I summarize the number of districts reporting distance education participation by state.

# SQL-style aggregation using pandas.
distance_education_by_state = (
    df_distance_education_clean
    .groupby("state_name")["distance_education_indicator"]
    .apply(lambda x: (x == "Yes").sum())
    .reset_index(name="districts_using_distance_education")
)

# Sort results.
distance_education_by_state = (
    distance_education_by_state
    .sort_values(
        by="districts_using_distance_education",
        ascending=False
    )
)

# Display top states using distance education.
distance_education_by_state.head(10)

	state_name	districts_using_distance_education
44	TEXAS	506
22	MICHIGAN	364
38	PENNSYLVANIA	315
15	IOWA	275
35	OHIO	275
32	NEW YORK	272
3	ARKANSAS	247
4	CALIFORNIA	239
25	MISSOURI	229
50	WISCONSIN	220

Visualizing Distance Education Adoption by State

To better understand technology adoption patterns, I create a visualization showing the states with the highest number of districts participating in distance education.

# Select the top 10 states using distance education.
top_distance_education_states = (
    distance_education_by_state
    .head(10)
    .sort_values(
        by="districts_using_distance_education",
        ascending=True
    )
)

# Create the chart.
plt.figure(figsize=(12, 6))

plt.barh(
    top_distance_education_states["state_name"],
    top_distance_education_states["districts_using_distance_education"]
)

# Add chart title and labels.
plt.title("Top States by District Distance Education Adoption")
plt.xlabel("Districts Using Distance Education")
plt.ylabel("State")

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

Creating the Distance Education Adoption Fact Table

Before aggregating, a data-quality note: the CRDC distance-education file uses -9 as a reserved code for “missing or suppressed value,” not as a real student count. Roughly 65% of the districts in this file (11,384 out of 17,604) report -9 for both the male and female totals. If those codes are treated as numbers and summed, the resulting state totals are nonsense — several states end up with negative student counts. I replace the sentinel codes with NaN so they’re excluded from the sum rather than corrupting it.

# Pull the columns we need from the cleaned distance-ed dataset.
fact_distance_education = df_distance_education_clean[
    [
        "district_id",
        "state_name",
        "district_name",
        "distance_education_indicator",
        "male_distance_education_enrollment",
        "female_distance_education_enrollment"
    ]
].copy()

# Coerce to numeric. errors="coerce" turns non-numeric strings into NaN.
for col in ["male_distance_education_enrollment", "female_distance_education_enrollment"]:
    fact_distance_education[col] = pd.to_numeric(
        fact_distance_education[col], errors="coerce"
    )

# Replace CRDC sentinel codes for suppressed/missing data with NaN.
# Negative values are not valid student counts in this file.
sentinel_codes = [-3, -5, -6, -8, -9, -11]
for col in ["male_distance_education_enrollment", "female_distance_education_enrollment"]:
    fact_distance_education.loc[
        fact_distance_education[col].isin(sentinel_codes), col
    ] = np.nan

# Total enrollment per district. With sentinels nulled out, the sum treats
# missing values as missing instead of as -9.
fact_distance_education["total_distance_education_enrollment"] = (
    fact_distance_education["male_distance_education_enrollment"]
    .add(fact_distance_education["female_distance_education_enrollment"], fill_value=0)
)

# How much of the file was affected? Useful to show explicitly.
total_districts = len(fact_distance_education)
suppressed_districts = (
    fact_distance_education["male_distance_education_enrollment"].isna()
    & fact_distance_education["female_distance_education_enrollment"].isna()
).sum()
print(f"Districts in file: {total_districts:,}")
print(f"Districts with both M and F suppressed (no usable count): {suppressed_districts:,} "
      f"({suppressed_districts / total_districts:.1%})")

fact_distance_education.head()

Districts in file: 17,604
Districts with both M and F suppressed (no usable count): 11,384 (64.7%)

	district_id	state_name	district_name	distance_education_indicator	male_distance_education_enrollment	female_distance_education_enrollment	total_distance_education_enrollment
0	0100002	ALABAMA	Alabama Youth Services	No	NaN	NaN	NaN
1	0100005	ALABAMA	Albertville City	Yes	4.0	14.0	18.0
2	0100006	ALABAMA	Marshall County	Yes	64.0	75.0	139.0
3	0100007	ALABAMA	Hoover City	Yes	218.0	340.0	558.0
4	0100008	ALABAMA	Madison City	Yes	89.0	154.0	243.0

Loading the Distance Education Fact Table into PostgreSQL

After creating the distance education fact table, I load it into PostgreSQL so technology adoption metrics can be queried with SQL and connected to the broader education warehouse.

# Load the distance education fact table into PostgreSQL.
fact_distance_education.to_sql(
    name="fact_distance_education_adoption",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("Distance education adoption fact table loaded successfully into PostgreSQL.")

Distance education adoption fact table loaded successfully into PostgreSQL.

Verifying the Distance Education Fact Table in PostgreSQL

To confirm the distance education adoption fact table loaded correctly, I query PostgreSQL and preview the stored technology adoption records.

# Query PostgreSQL to preview the distance education adoption fact table.
distance_education_preview_query = """
SELECT *
FROM public.fact_distance_education_adoption
LIMIT 10;
"""

# Display distance education adoption records.
pd.read_sql(distance_education_preview_query, engine)

	district_id	state_name	district_name	distance_education_indicator	male_distance_education_enrollment	female_distance_education_enrollment	total_distance_education_enrollment
0	0100002	ALABAMA	Alabama Youth Services	No	NaN	NaN	NaN
1	0100005	ALABAMA	Albertville City	Yes	4.0	14.0	18.0
2	0100006	ALABAMA	Marshall County	Yes	64.0	75.0	139.0
3	0100007	ALABAMA	Hoover City	Yes	218.0	340.0	558.0
4	0100008	ALABAMA	Madison City	Yes	89.0	154.0	243.0
5	0100009	ALABAMA	Al Inst Deaf And Blind	No	NaN	NaN	NaN
6	0100011	ALABAMA	Leeds City	Yes	71.0	64.0	135.0
7	0100012	ALABAMA	Boaz City	Yes	13.0	32.0	45.0
8	0100013	ALABAMA	Trussville City	No	NaN	NaN	NaN
9	0100018	ALABAMA	Alabama School of Fine Arts	No	NaN	NaN	NaN

Analyzing Distance Education Enrollment by State

To evaluate the scale of technology-enabled learning participation, I summarize total distance education enrollment by state.

# SQL query to summarize distance education enrollment by state.
distance_education_enrollment_query = read_sql_file("distance_ed_top_states.sql")
# Display top states by distance education enrollment.
pd.read_sql(distance_education_enrollment_query, engine)

	state_name	total_distance_education_enrollment
0	TEXAS	103166.0
1	CALIFORNIA	98436.0
2	GEORGIA	72605.0
3	ARKANSAS	59193.0
4	MICHIGAN	51837.0
5	PENNSYLVANIA	48042.0
6	IOWA	43741.0
7	VIRGINIA	42440.0
8	NORTH CAROLINA	38877.0
9	OHIO	38492.0

Visualizing Distance Education Enrollment by State

To better understand technology-enabled learning participation, I create a visualization showing the states with the highest distance education enrollment totals.

# Store the distance education enrollment query results in a DataFrame.
distance_education_enrollment_df = pd.read_sql(
    distance_education_enrollment_query,
    engine
)

# Convert numeric values.
distance_education_enrollment_df[
    "total_distance_education_enrollment"
] = pd.to_numeric(
    distance_education_enrollment_df[
        "total_distance_education_enrollment"
    ]
)

# Sort values for visualization.
distance_education_enrollment_df = (
    distance_education_enrollment_df
    .sort_values(
        by="total_distance_education_enrollment",
        ascending=True
    )
)

# Create the chart.
plt.figure(figsize=(12, 6))

plt.barh(
    distance_education_enrollment_df["state_name"],
    distance_education_enrollment_df[
        "total_distance_education_enrollment"
    ]
)

# Add chart title and labels.
plt.title("Top States by Distance Education Enrollment")
plt.xlabel("Distance Education Enrollment")
plt.ylabel("State")

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

Connecting Distance Education Adoption with NAEP Math Performance

To begin evaluating the relationship between technology-enabled learning participation and student outcomes, I join state-level distance education enrollment with 2024 NAEP Grade 8 mathematics performance scores.

# SQL query joining distance education enrollment with 2024 NAEP math scores.
distance_education_naep_query = read_sql_file("distance_ed_naep_comparison.sql")
# Display the joined distance education and NAEP performance dataset.
pd.read_sql(distance_education_naep_query, engine)

	state_name	total_distance_education_enrollment	average_scale_score
0	TEXAS	103166.0	269.393041
1	CALIFORNIA	98436.0	268.779818
2	GEORGIA	72605.0	268.655805
3	ARKANSAS	59193.0	266.194834
4	MICHIGAN	51837.0	269.978774
5	PENNSYLVANIA	48042.0	276.245915
6	IOWA	43741.0	274.823931
7	VIRGINIA	42440.0	274.988690
8	NORTH CAROLINA	38877.0	275.821518
9	OHIO	38492.0	278.812721
10	MASSACHUSETTS	34140.0	283.465559
11	ARIZONA	34048.0	269.656164
12	NEVADA	30938.0	265.177162
13	ALABAMA	25343.0	261.769896
14	SOUTH CAROLINA	21815.0	268.006371
15	COLORADO	21727.0	278.107690
16	UTAH	21405.0	281.790139
17	INDIANA	20332.0	278.185064
18	NEW MEXICO	18179.0	256.210854
19	OREGON	18157.0	267.940476
20	LOUISIANA	14579.0	266.783264
21	WASHINGTON	14276.0	273.665174
22	FLORIDA	14121.0	267.242908
23	WISCONSIN	12322.0	282.649634
24	IDAHO	11994.0	278.104479
25	MISSOURI	10067.0	270.363197
26	ILLINOIS	9845.0	277.416366
27	MINNESOTA	9657.0	282.087421
28	NEW YORK	8990.0	270.851724
29	KENTUCKY	8462.0	270.813300
30	MARYLAND	6948.0	268.213760
31	OKLAHOMA	6893.0	264.454031
32	NEW JERSEY	6222.0	281.684698
33	ALASKA	6100.0	263.962653
34	TENNESSEE	5783.0	275.982988
35	NORTH DAKOTA	4195.0	279.765445
36	KANSAS	4096.0	274.136728
37	NEW HAMPSHIRE	3556.0	279.744642
38	MISSISSIPPI	3246.0	269.060485
39	NEBRASKA	3136.0	279.883496
40	WEST VIRGINIA	2750.0	260.769078
41	SOUTH DAKOTA	2739.0	281.101137
42	MONTANA	2510.0	279.082096
43	DELAWARE	1693.0	263.057137
44	WYOMING	1566.0	278.315324
45	MAINE	1395.0	272.621883
46	CONNECTICUT	1297.0	276.674248
47	PUERTO RICO	1190.0	216.087041
48	VERMONT	1051.0	275.617661
49	HAWAII	813.0	270.041103
50	RHODE ISLAND	745.0	269.826789
51	DISTRICT OF COLUMBIA	218.0	261.538791

Visualizing Distance Education Enrollment and NAEP Math Performance

Same approach as the enrollment-size analysis: scatter plot with a Spearman correlation. Each point is one state.

distance_education_naep_df = pd.read_sql(distance_education_naep_query, engine)

distance_education_naep_df["total_distance_education_enrollment"] = pd.to_numeric(
    distance_education_naep_df["total_distance_education_enrollment"]
)
distance_education_naep_df["average_scale_score"] = pd.to_numeric(
    distance_education_naep_df["average_scale_score"]
).round(2)

# Spearman correlation between distance-ed enrollment and 2024 NAEP score.
rho_de, p_de = spearmanr(
    distance_education_naep_df["total_distance_education_enrollment"],
    distance_education_naep_df["average_scale_score"],
)

fig, ax = plt.subplots(figsize=(11, 6))
ax.scatter(
    distance_education_naep_df["total_distance_education_enrollment"],
    distance_education_naep_df["average_scale_score"],
    s=40,
)

# Annotate the most extreme states on each axis so the chart isn't anonymous.
de_extremes = pd.concat([
    distance_education_naep_df.nlargest(4, "total_distance_education_enrollment"),
    distance_education_naep_df.nlargest(2, "average_scale_score"),
    distance_education_naep_df.nsmallest(2, "average_scale_score"),
]).drop_duplicates(subset=["state_name"])

for _, row in de_extremes.iterrows():
    ax.annotate(
        row["state_name"].title(),
        (row["total_distance_education_enrollment"], row["average_scale_score"]),
        xytext=(5, 5), textcoords="offset points", fontsize=9,
    )

ax.set_title(
    f"Distance Education Enrollment vs. 2024 NAEP Math Score (Spearman ρ = {rho_de:.2f})"
)
ax.set_xlabel("State Distance Education Enrollment (CRDC, sentinel codes excluded)")
ax.set_ylabel("2024 NAEP Grade 8 Math — Average Scale Score")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Spearman: rho = {rho_de:.3f}, p-value = {p_de:.3f}")

Spearman: rho = -0.026, p-value = 0.856

Distance Education and NAEP Performance Interpretation

The headline number: Spearman ρ is near zero, p-value well above 0.05. There’s no meaningful rank correlation between state distance-education enrollment and 2024 NAEP Grade 8 math scores. Texas leads in raw distance-ed enrollment and sits below the median on NAEP; Massachusetts has middling distance-ed enrollment and tops the score chart. The relationship just isn’t there at the state aggregate level.

A few things worth being explicit about so this isn’t overclaimed:

This is correlation, not causation. Even a strong correlation here wouldn’t tell you whether distance education helps, hurts, or has no effect — districts adopting distance ed at scale differ from non-adopters in many ways beyond their technology choices.
State-level aggregates hide everything interesting underneath. A state with 100,000 distance-ed students could be 100 districts with 1,000 each, or 5 districts with 20,000 each. Whether the program “worked” is a district-level question that disappears in the rollup.
2024 is one year. Distance education adoption surged 2020–2022 and has been receding since. A single-year snapshot can’t show that arc. Useful follow-up: trend fact_distance_education_adoption against NAEP year-over-year.
Distance-ed enrollment is not the same as technology access. A district can have universal device + broadband access and zero distance-ed enrollment. CRDC has fields closer to “access” (1:1 device programs, broadband infrastructure) that I haven’t pulled into the warehouse yet — that’s the natural next layer.

What this analysis does establish is the pipeline plumbing: every CRDC district can be joined to its state, its enrollment, its distance-ed participation, and its state’s NAEP scores in one SQL query. Once technology-access fields are added, the same join shape extends to answer the more specific question.

Final Technology Impact Analysis Conclusion

This project built an end-to-end data engineering and analytics workflow to ask a focused question: does technology-enabled learning participation in U.S. public schools associate with measurable differences in academic performance? The short answer, from the data integrated here, is no — not at the state aggregate level, and not in a way that survives basic statistical scrutiny.

The longer answer is that the question is harder than it looks. The most informative outcome of this project isn’t the correlation itself; it’s the warehouse and the pipeline that make the next, harder version of the question tractable. Once district-level technology access measures (1:1 device programs, broadband, instructional technology investment) are layered into the same fact-table structure, the same join shape — district → state → year → outcome — can answer a question that state-level distance-ed enrollment can’t.

What this iteration delivered:

Four federal datasets (CCD operational data, NAEP Grade 8 math, CRDC district characteristics, CRDC distance education) extracted, cleaned, and loaded into PostgreSQL
A dimensional warehouse (dim_states, dim_districts, dim_years, plus four fact tables) that supports state-level, district-level, and longitudinal analysis
A documented data-quality fix for the CRDC -9 sentinel-code issue — 65% of districts in the distance-ed file had suppressed values that produced negative state totals in earlier iterations; the corrected aggregation drops California’s total from understated-negative territory to ~98K students and Texas to ~103K
Honest statistical reporting — Spearman correlations with p-values rather than eyeballed bar charts, and scatter plots that don’t hide the underlying distribution
A reproducible report rendered with Quarto that ties code, results, and interpretation together in one place

What this iteration didn’t deliver, and which I’d want to add next:

District-level technology access measures from CRDC, not just distance-ed participation
Year-over-year trend modeling for the post-pandemic period using fact_distance_education_adoption over time
Tests on the ETL transforms — particularly around the sentinel-code handling, since that’s the bug class that bit me here

The skills exercised across the workflow: dimensional warehouse design in PostgreSQL, SQL analytics (CTEs, conditional aggregation, multi-source joins), data quality investigation and remediation, statistical reporting with appropriate caveats, and reproducible reporting with Quarto.

Live Tableau Dashboards

The warehouse and analysis in this report also power three interactive dashboards published on Tableau Public. They’re built on top of the same fact and dimension tables — the static charts here are a snapshot; the dashboards let you filter, hover, and explore the data yourself.

NAEP Grade 8 Mathematics — National & State Performance — A color-coded U.S. map of 2024 scores, the national trend line from 2009 through 2024, multi-state comparisons, and the post-pandemic delta by state.
Distance Education Adoption — Reach and Relationship to Performance — Top 10 states by distance-education enrollment, the distance-ed vs. NAEP scatter that anchors this report’s headline finding, and the share of districts that even report participating.
System Size & Operations — Public School Infrastructure Across States — Top 10 states by operational school count, average schools per district (a proxy for how consolidated each state’s system is), and total enrollment plotted against 2024 NAEP scores.