Education Technology Impact Analysis

Author

Naseer Aryobee

Project Overview

This project analyzes how education technology adoption may relate to public school performance across the United States.

The goal of this project is to build a portfolio-grade data engineering and analytics workflow that extracts, cleans, transforms, loads, and analyzes education-related datasets using Python, PostgreSQL, SQL, and Quarto.

I use this workflow to develop a reproducible analytics pipeline that can support long-term research into whether access to technology in schools is associated with measurable differences in educational outcomes.

Project Objectives

The main objectives of this project are to:

  • Extract raw public education, student outcome, and technology participation datasets from trusted national sources
  • Clean and standardize inconsistent public-sector data files
  • Build a PostgreSQL analytical warehouse using fact and dimension tables
  • Connect district, state, year, educational outcome, and technology adoption data through a dimensional model
  • Perform SQL-based analysis directly against PostgreSQL
  • Create visualizations and reporting outputs inside a reproducible Quarto HTML report
  • Evaluate how technology-enabled learning participation may relate to measurable student outcome trends

Data Sources

This project uses publicly available education and technology-related datasets, including:

Dataset Source Purpose
CCD Common Core of Data District and school operational information
NAEP National Assessment of Educational Progress State-level academic performance outcomes
CRDC Civil Rights Data Collection — LEA Characteristics District-level characteristics and enrollment information
CRDC Distance Education Technology-enabled learning participation and distance education enrollment indicators

The completed workflow integrates CCD district data, NAEP Grade 8 mathematics performance data, CRDC district characteristics, and CRDC distance education participation metrics. Together, these datasets support analysis of how technology adoption and district-level conditions may relate to student outcome patterns.

Tools and Technologies

This project uses:

Tool Purpose
Python Data extraction, cleaning, transformation, and validation
pandas Data manipulation and profiling
PostgreSQL Analytical warehouse storage
SQL Warehouse querying and analysis
Quarto Reproducible HTML reporting
matplotlib Data visualization
GitHub Version control and portfolio documentation

Research Direction

The central research question behind this project is:

Does increased access to technology-enabled learning in public schools appear to be associated with measurable differences in academic performance across states, districts, and reporting years?

This project builds an end-to-end data engineering and analytics workflow to explore that question using integrated public education datasets, PostgreSQL warehouse modeling, SQL analysis, and reproducible reporting.

Project Setup

Before any data work, I load the libraries used throughout the report and connect to the local PostgreSQL warehouse. Credentials are read from a .env file (see .env.example for the variables you need to define locally). The .env file itself is gitignored — it never leaves the machine.

import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv, find_dotenv
from sqlalchemy import create_engine
from sqlalchemy.engine import URL

# Load credentials from the local .env file.
load_dotenv(find_dotenv(usecwd=True))

DB_USER     = os.getenv("DB_USER")
DB_PASSWORD = os.getenv("DB_PASSWORD")
DB_HOST     = os.getenv("DB_HOST")
DB_PORT     = os.getenv("DB_PORT")
DB_NAME     = os.getenv("DB_NAME")

# Fail fast if anything is missing — easier to debug than a cryptic error later.
required = {
    "DB_USER": DB_USER, "DB_PASSWORD": DB_PASSWORD,
    "DB_HOST": DB_HOST, "DB_PORT": DB_PORT, "DB_NAME": DB_NAME
}
missing = [k for k, v in required.items() if not v]
if missing:
    raise RuntimeError(f"Missing environment variables: {missing}")

# URL.create handles special characters in the password safely.
database_url = URL.create(
    drivername="postgresql+psycopg2",
    username=DB_USER,
    password=DB_PASSWORD,
    host=DB_HOST,
    port=int(DB_PORT),
    database=DB_NAME,
)
engine = create_engine(database_url)

# Analytical SQL queries live in sql/queries/ rather than being embedded
# inline as Python strings. This helper reads one of those files.
# The notebooks/ folder is one level under the repo root, so queries are
# at ../sql/queries/<filename>.sql relative to this .qmd.
SQL_QUERIES_DIR = Path("../sql/queries").resolve()

def read_sql_file(filename: str) -> str:
    """Read a SQL file from sql/queries/ and return its contents."""
    path = SQL_QUERIES_DIR / filename
    if not path.exists():
        raise FileNotFoundError(f"SQL file not found: {path}")
    return path.read_text(encoding="utf-8")

# Quick connection sanity check.
with engine.connect() as conn:
    print(f"Connected to PostgreSQL: {DB_NAME} @ {DB_HOST}:{DB_PORT}")
Connected to PostgreSQL: education_tech_impact @ localhost:5432

Data Extraction — CCD Dataset

In this step, I load the Common Core of Data (CCD) dataset into Python for initial inspection and preprocessing.

This dataset contains district-level information including enrollment, number of schools, and geographic details.

I use pandas to read the CSV file and examine its structure before loading it into PostgreSQL.

# Define the relative path to the CCD dataset.
# The raw data is stored locally and excluded from GitHub using .gitignore.
file_path = "../data/raw/ccd/ccd_lea_2024.csv"

# Load the CCD CSV file into a pandas DataFrame.
# This allows me to inspect the dataset before database loading.
df_ccd = pd.read_csv(file_path)

# Display the first five records to confirm the file loaded correctly.
df_ccd.head()
C:\Users\User\AppData\Local\Temp\ipykernel_21316\1148928295.py:7: DtypeWarning: Columns (0: MSTREET3) have mixed types. Specify dtype option on import or set low_memory=False.
  df_ccd = pd.read_csv(file_path)
SCHOOL_YEAR FIPST STATENAME ST LEA_NAME STATE_AGENCY_NO UNION ST_LEAID LEAID MSTREET1 ... G_11_OFFERED G_12_OFFERED G_13_OFFERED G_UG_OFFERED G_AE_OFFERED GSLO GSHI LEVEL IGOFFERED OPERATIONAL_SCHOOLS
0 2024-2025 1 ALABAMA AL Alabama Youth Services 1 NaN AL-210 100002 1000 Industrial School Road ... Yes Yes No No No KG 12 Other As reported 0
1 2024-2025 1 ALABAMA AL Albertville City 1 NaN AL-101 100005 8379 US Highway 431 ... Yes Yes No No No PK 12 Other As reported 7
2 2024-2025 1 ALABAMA AL Marshall County 1 NaN AL-048 100006 12380 US Highway 431 S ... Yes Yes No No No PK 12 Other As reported 17
3 2024-2025 1 ALABAMA AL Hoover City 1 NaN AL-158 100007 2810 Metropolitan Way ... Yes Yes No No No PK 12 Other As reported 18
4 2024-2025 1 ALABAMA AL Madison City 1 NaN AL-169 100008 211 Celtic Dr ... Yes Yes No No No PK 12 Other As reported 13

5 rows × 58 columns

Initial Data Inspection — CCD Dataset

In this step, I inspect the CCD dataset to understand its size, structure, and key columns before performing any cleaning or transformation.

This inspection helps confirm that the data loaded correctly and provides an initial understanding of the fields available for analysis.

# Display the number of rows and columns in the CCD dataset.
# This gives me a quick overview of the dataset size before cleaning.
row_count, column_count = df_ccd.shape

print(f"Number of rows: {row_count}")
print(f"Number of columns: {column_count}")

# Display all column names so I can identify fields needed for the district dimension table.
df_ccd.columns.tolist()
Number of rows: 19630
Number of columns: 58
['SCHOOL_YEAR',
 'FIPST',
 'STATENAME',
 'ST',
 'LEA_NAME',
 'STATE_AGENCY_NO',
 'UNION',
 'ST_LEAID',
 'LEAID',
 'MSTREET1',
 'MSTREET2',
 'MSTREET3',
 'MCITY',
 'MSTATE',
 'MZIP',
 'MZIP4',
 'LSTREET1',
 'LSTREET2',
 'LSTREET3',
 'LCITY',
 'LSTATE',
 'LZIP',
 'LZIP4',
 'PHONE',
 'WEBSITE',
 'SY_STATUS',
 'SY_STATUS_TEXT',
 'UPDATED_STATUS',
 'UPDATED_STATUS_TEXT',
 'EFFECTIVE_DATE',
 'LEA_TYPE',
 'LEA_TYPE_TEXT',
 'OUT_OF_STATE_FLAG',
 'CHARTER_LEA',
 'CHARTER_LEA_TEXT',
 'NOGRADES',
 'G_PK_OFFERED',
 'G_KG_OFFERED',
 'G_1_OFFERED',
 'G_2_OFFERED',
 'G_3_OFFERED',
 'G_4_OFFERED',
 'G_5_OFFERED',
 'G_6_OFFERED',
 'G_7_OFFERED',
 'G_8_OFFERED',
 'G_9_OFFERED',
 'G_10_OFFERED',
 'G_11_OFFERED',
 'G_12_OFFERED',
 'G_13_OFFERED',
 'G_UG_OFFERED',
 'G_AE_OFFERED',
 'GSLO',
 'GSHI',
 'LEVEL',
 'IGOFFERED',
 'OPERATIONAL_SCHOOLS']

Selecting District-Level Columns

In this step, I identify the key district-level fields that will be used to build the district dimension table in PostgreSQL.

The CCD dataset contains many administrative and metadata fields. For the initial data model, I focus on the core district identification and geographic columns needed for analysis.

# Select district-related columns needed for the dimension table.
district_columns = [
    "LEAID",
    "LEA_NAME",
    "ST",
    "STATENAME",
    "MCITY",
    "WEBSITE",
    "OPERATIONAL_SCHOOLS"
]

# Create a smaller district-focused DataFrame.
df_districts = df_ccd[district_columns]

# Display the first five district records.
df_districts.head()
LEAID LEA_NAME ST STATENAME MCITY WEBSITE OPERATIONAL_SCHOOLS
0 100002 Alabama Youth Services AL ALABAMA Mt Meigs http://www.dys.alabama.gov/school-district.html 0
1 100005 Albertville City AL ALABAMA Albertville http://www.albertk12.org 7
2 100006 Marshall County AL ALABAMA Guntersville http://www.marshallk12.org 17
3 100007 Hoover City AL ALABAMA Hoover http://www.hoovercityschools.net 18
4 100008 Madison City AL ALABAMA Madison http://www.madisoncity.k12.al.us 13

Cleaning the District Dimension Dataset

Before loading the district data into PostgreSQL, I clean the dataset to remove duplicate districts and standardize the column structure.

This step helps ensure the dimension table contains one unique record per school district.

# Create a copy of the district DataFrame to preserve the original dataset.
df_district_dim = df_districts.copy()

# Remove duplicate district records using the LEAID identifier.
df_district_dim = df_district_dim.drop_duplicates(subset=["LEAID"])

# Standardize column names for PostgreSQL compatibility.
df_district_dim.columns = [
    "district_id",
    "district_name",
    "state_code",
    "state_name",
    "city",
    "website",
    "operational_schools"
]

# Reset the index after cleaning so the DataFrame has a clean sequence.
df_district_dim = df_district_dim.reset_index(drop=True)

# Display the cleaned district dimension dataset.
df_district_dim.head()
district_id district_name state_code state_name city website operational_schools
0 100002 Alabama Youth Services AL ALABAMA Mt Meigs http://www.dys.alabama.gov/school-district.html 0
1 100005 Albertville City AL ALABAMA Albertville http://www.albertk12.org 7
2 100006 Marshall County AL ALABAMA Guntersville http://www.marshallk12.org 17
3 100007 Hoover City AL ALABAMA Hoover http://www.hoovercityschools.net 18
4 100008 Madison City AL ALABAMA Madison http://www.madisoncity.k12.al.us 13

Loading the District Dimension Table into PostgreSQL

After cleaning the district dataset, I load the data into PostgreSQL as a dimension table.

This step creates the foundation for future analytics and fact tables that will be connected through district identifiers.

# Load the cleaned district dimension dataset into PostgreSQL.
# if_exists="replace" allows this workflow to be rerun during development
# without creating duplicate tables or duplicate records.
df_district_dim.to_sql(
    name="dim_districts",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("District dimension table loaded successfully into PostgreSQL.")
District dimension table loaded successfully into PostgreSQL.

Verifying the PostgreSQL Target Database

Before continuing, I verify which PostgreSQL database the workflow is connected to and confirm where the table was created.

# Query PostgreSQL to confirm the active database, schema, and user.
connection_check_query = """
SELECT
    current_database() AS active_database,
    current_schema() AS active_schema,
    current_user AS active_user;
"""

# Display the PostgreSQL connection details.
pd.read_sql(connection_check_query, engine)
active_database active_schema active_user
0 education_tech_impact public postgres

Verifying the District Dimension Table in PostgreSQL

After loading the district dimension table, I verify that the table exists inside PostgreSQL.

# Query PostgreSQL to confirm the dimension table exists.
table_check_query = """
SELECT
    table_schema,
    table_name
FROM information_schema.tables
WHERE table_schema = 'public'
  AND table_name = 'dim_districts';
"""

# Display matching PostgreSQL tables.
pd.read_sql(table_check_query, engine)
table_schema table_name
0 public dim_districts

Counting Records in the District Dimension Table

To confirm the table was loaded correctly, I query PostgreSQL and count the total number of district records stored in the dimension table.

# Count records inside the PostgreSQL dimension table.
district_count_query = """
SELECT COUNT(*) AS total_district_records
FROM public.dim_districts;
"""

# Display the total number of district records.
pd.read_sql(district_count_query, engine)
total_district_records
0 19630

Previewing Records from the District Dimension Table

As a final validation step, I preview the first few records directly from the PostgreSQL table.

# Preview records directly from the PostgreSQL dimension table.
district_preview_query = """
SELECT
    district_id,
    district_name,
    state_code,
    state_name,
    city,
    operational_schools
FROM public.dim_districts
LIMIT 5;
"""

# Display the first five records from PostgreSQL.
pd.read_sql(district_preview_query, engine)
district_id district_name state_code state_name city operational_schools
0 100002 Alabama Youth Services AL ALABAMA Mt Meigs 0
1 100005 Albertville City AL ALABAMA Albertville 7
2 100006 Marshall County AL ALABAMA Guntersville 17
3 100007 Hoover City AL ALABAMA Hoover 18
4 100008 Madison City AL ALABAMA Madison 13

Creating the State Dimension Table

To support state-level analysis, I create a separate state dimension table from the cleaned district dataset.

This table stores one unique record per state and allows future fact tables to connect district-level data to state-level summaries.

# Create the state dimension table from the cleaned district dataset.
df_state_dim = (
    df_district_dim[["state_code", "state_name"]]
    .drop_duplicates()
    .sort_values(by="state_code")
    .reset_index(drop=True)
)

# Add a numeric state key for PostgreSQL modeling.
df_state_dim.insert(0, "state_key", range(1, len(df_state_dim) + 1))

# Display the state dimension table.
df_state_dim.head()
state_key state_code state_name
0 1 AK ALASKA
1 2 AL ALABAMA
2 3 AR ARKANSAS
3 4 AS AMERICAN SAMOA
4 5 AZ ARIZONA

Loading the State Dimension Table into PostgreSQL

After creating the state dimension dataset, I load the table into PostgreSQL for future state-level analytics and reporting.

# Load the state dimension table into PostgreSQL.
df_state_dim.to_sql(
    name="dim_states",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("State dimension table loaded successfully into PostgreSQL.")
State dimension table loaded successfully into PostgreSQL.

Verifying the State Dimension Table in PostgreSQL

To confirm the state dimension table loaded successfully, I query PostgreSQL and count the total number of state records.

state_count_query = """
SELECT COUNT(*) AS total_state_records
FROM public.dim_states;
"""

pd.read_sql(state_count_query, engine)
total_state_records
0 57

Creating the District School Counts Fact Table

To support analytical reporting, I create the first fact table using district-level school count data.

This table connects each district to its state and stores the number of operational schools reported in the CCD dataset.

# Create a fact table for district-level school counts.
fact_district_school_counts = df_district_dim[
    [
        "district_id",
        "state_code",
        "operational_schools"
    ]
].copy()

# Rename the school count column to make the metric purpose clear.
fact_district_school_counts = fact_district_school_counts.rename(
    columns={"operational_schools": "operational_school_count"}
)

# Display the first five records from the fact table.
fact_district_school_counts.head()
district_id state_code operational_school_count
0 100002 AL 0
1 100005 AL 7
2 100006 AL 17
3 100007 AL 18
4 100008 AL 13

Loading the District School Counts Fact Table into PostgreSQL

After creating the district school counts fact table, I load it into PostgreSQL so it can be queried with SQL and used for future dashboard analysis.

# Load the district school counts fact table into PostgreSQL.
fact_district_school_counts.to_sql(
    name="fact_district_school_counts",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("District school counts fact table loaded successfully into PostgreSQL.")
District school counts fact table loaded successfully into PostgreSQL.

Analyzing States with the Highest Number of Operational Schools

To begin the analytical portion of the project, I query PostgreSQL to identify which states have the highest total number of operational schools.

This query joins the state dimension table with the district school counts fact table and aggregates the results at the state level.

# SQL query to identify states with the highest operational school counts.
top_states_query = read_sql_file("top_states_by_operational_schools.sql")
# Execute the SQL query and display the results.
pd.read_sql(top_states_query, engine)
state_name total_operational_schools
0 CALIFORNIA 10303.0
1 TEXAS 9192.0
2 NEW YORK 4810.0
3 ILLINOIS 4398.0
4 FLORIDA 4186.0
5 OHIO 3543.0
6 MICHIGAN 3468.0
7 PENNSYLVANIA 2932.0
8 MINNESOTA 2741.0
9 NORTH CAROLINA 2705.0

Visualizing the Top 10 States by Operational Schools

To make the SQL analysis easier to interpret, I create a bar chart showing the top 10 states by total operational school count.

This visualization helps translate the PostgreSQL query results into a clear portfolio-ready analytical output.

# Store the SQL query result in a DataFrame for visualization.
top_states_df = pd.read_sql(top_states_query, engine)

# Create a horizontal bar chart for readability.
plt.figure(figsize=(10, 6))
plt.barh(
    top_states_df["state_name"],
    top_states_df["total_operational_schools"]
)

# Invert the y-axis so the state with the highest count appears at the top.
plt.gca().invert_yaxis()

# Add chart labels and title.
plt.title("Top 10 States by Operational School Count")
plt.xlabel("Total Operational Schools")
plt.ylabel("State")

# Improve spacing so labels do not overlap.
plt.subplots_adjust(left=0.28)

# Display the chart in the Quarto HTML report.
plt.show()

Analyzing Average Operational Schools per District by State

To compare state-level district structures, I calculate the average number of operational schools per district for each state.

This helps show which states tend to have larger or smaller district-level school systems.

# SQL query to calculate the average number of operational schools per district by state.
avg_schools_per_district_query = read_sql_file("avg_schools_per_district_by_state.sql")
# Execute the SQL query and display the results.
pd.read_sql(avg_schools_per_district_query, engine)
state_name avg_operational_schools_per_district
0 PUERTO RICO 862.00
1 HAWAII 296.00
2 MARYLAND 56.32
3 FLORIDA 49.83
4 NEVADA 36.75
5 NORTHERN MARIANAS 35.00
6 AMERICAN SAMOA 28.00
7 SOUTH CAROLINA 13.30
8 TENNESSEE 12.84
9 VIRGINIA 10.17

Visualizing Average Operational Schools per District by State

To better interpret the state-level district averages, I create a second visualization showing the average number of operational schools per district.

# Store the SQL query result in a DataFrame for visualization.
avg_schools_df = pd.read_sql(avg_schools_per_district_query, engine)

# Create a horizontal bar chart.
plt.figure(figsize=(10, 6))
plt.barh(
    avg_schools_df["state_name"],
    avg_schools_df["avg_operational_schools_per_district"]
)

# Display the highest averages at the top.
plt.gca().invert_yaxis()

# Add chart labels and title.
plt.title("Average Operational Schools per District by State")
plt.xlabel("Average Operational Schools")
plt.ylabel("State")

# Improve chart spacing.
plt.subplots_adjust(left=0.28)

# Display the visualization.
plt.show()

Tableau Dashboard Development Plan

The core data engineering and analytical workflow is now complete. The remaining presentation layer for this project is Tableau dashboard development.

The Tableau dashboard will use the cleaned warehouse outputs and focus on communicating the project’s main findings clearly for recruiters, hiring managers, and interview discussions.

  • NAEP Grade 8 mathematics performance by state
  • Post-pandemic score changes between 2019 and 2024
  • Distance education adoption and participation by state
  • Enrollment size compared with educational performance
  • Technology-enabled learning participation compared with student outcome measures

This final dashboard layer will summarize the technical workflow in an executive-style visual format while the Quarto report remains the reproducible data engineering documentation.

Current PostgreSQL Warehouse Schema

At this stage of the project, the PostgreSQL warehouse contains dimension and fact tables supporting operational education data, student outcome analysis, district characteristics, and distance education technology adoption metrics.

Table Name Description
dim_districts Stores district-level identification and geographic information
dim_states Stores state-level identifiers and surrogate state keys
dim_years Stores reporting years for longitudinal analysis
dim_crdc_district_characteristics Stores CRDC district characteristics, enrollment, and district-level context
Table Name Description
fact_district_school_counts Stores operational school counts by district and state
fact_district_school_counts_enhanced Stores school counts using surrogate state keys for enhanced warehouse modeling
fact_naep_math_performance Stores NAEP Grade 8 mathematics performance outcomes by state and reporting year
fact_distance_education_adoption Stores CRDC distance education adoption and participation metrics by district

This warehouse structure supports multi-source educational analytics, longitudinal student outcome analysis, distance education adoption reporting, and Tableau dashboard development.

Data Model Relationships

The warehouse model connects operational, outcome, district characteristic, and technology adoption data through shared state, district, and year identifiers.

Fact / Analytical Table Related Dimension / Dataset Join Key
fact_district_school_counts dim_districts district_id
fact_district_school_counts_enhanced dim_states state_key
fact_naep_math_performance dim_states state_key
fact_naep_math_performance dim_years year_key
dim_crdc_district_characteristics fact_distance_education_adoption district_id
fact_distance_education_adoption NAEP state-level analytical outputs state_name

This design allows technology adoption metrics, district characteristics, enrollment scale, and student outcome measures to be analyzed together in a consistent warehouse-driven workflow.

Data Quality Validation Checks

Before using the warehouse tables for analysis, I perform several basic data quality checks to validate the integrity of the transformed data.

# Count missing district identifiers.
missing_district_ids = df_district_dim["district_id"].isnull().sum()

# Count missing state codes.
missing_state_codes = df_district_dim["state_code"].isnull().sum()

# Count duplicate district identifiers.
duplicate_district_ids = df_district_dim["district_id"].duplicated().sum()

# Display validation results.
validation_results = pd.DataFrame({
    "validation_check": [
        "Missing district IDs",
        "Missing state codes",
        "Duplicate district IDs"
    ],
    "result": [
        missing_district_ids,
        missing_state_codes,
        duplicate_district_ids
    ]
})

validation_results
validation_check result
0 Missing district IDs 0
1 Missing state codes 0
2 Duplicate district IDs 0

Creating the Year Dimension Table

To support longitudinal and multi-year analysis, I create a year dimension table.

This table connects NAEP reporting years and future multi-year education datasets through a consistent warehouse reporting year structure.

# Define the reporting years used in the project.
analysis_years = [
    2009,
    2011,
    2013,
    2015,
    2017,
    2019,
    2022,
    2024
]

# Create the year dimension table.
df_year_dim = pd.DataFrame({
    "year_key": range(1, len(analysis_years) + 1),
    "reporting_year": analysis_years
})

# Display the year dimension table.
df_year_dim
year_key reporting_year
0 1 2009
1 2 2011
2 3 2013
3 4 2015
4 5 2017
5 6 2019
6 7 2022
7 8 2024

Loading the Year Dimension Table into PostgreSQL

After creating the year dimension table, I load it into PostgreSQL so future CCD, NAEP, and CRDC datasets can connect to a consistent reporting year structure.

# Load the year dimension table into PostgreSQL.
df_year_dim.to_sql(
    name="dim_years",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("Year dimension table loaded successfully into PostgreSQL.")
Year dimension table loaded successfully into PostgreSQL.

Verifying the Year Dimension Table in PostgreSQL

To confirm the year dimension table loaded correctly, I query PostgreSQL and display the stored reporting years.

# Query PostgreSQL to display the year dimension table.
year_dimension_query = """
SELECT *
FROM public.dim_years
ORDER BY reporting_year;
"""

# Display the year dimension records.
pd.read_sql(year_dimension_query, engine)
year_key reporting_year
0 1 2009
1 2 2011
2 3 2013
3 4 2015
4 5 2017
5 6 2019
6 7 2022
7 8 2024

Adding a Surrogate Key to the State Dimension

To improve warehouse design and prepare for future multi-dataset integration, I create a surrogate key for the state dimension table.

Creating the State Dimension Table with Surrogate Keys

To support future warehouse joins and multi-dataset integration, I create a dedicated state dimension table with surrogate integer keys.

# Create a unique state dimension table from the district dimension dataset.
df_states_dim = (
    df_district_dim[
        ["state_code", "state_name"]
    ]
    .drop_duplicates()
    .sort_values("state_name")
    .reset_index(drop=True)
)

# Add a surrogate integer key for each state.
df_states_dim.insert(
    0,
    "state_key",
    range(1, len(df_states_dim) + 1)
)

# Display the state dimension table.
df_states_dim.head()
state_key state_code state_name
0 1 AL ALABAMA
1 2 AK ALASKA
2 3 AS AMERICAN SAMOA
3 4 AZ ARIZONA
4 5 AR ARKANSAS

Loading the Enhanced State Dimension into PostgreSQL

After adding surrogate keys to the state dimension table, I load the enhanced warehouse dimension into PostgreSQL.

# Load the enhanced state dimension table into PostgreSQL.
df_states_dim.to_sql(
    name="dim_states",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("Enhanced state dimension table loaded successfully into PostgreSQL.")
Enhanced state dimension table loaded successfully into PostgreSQL.

Verifying the Enhanced State Dimension in PostgreSQL

To confirm the enhanced state dimension table loaded correctly, I query PostgreSQL and display the stored surrogate keys and state records.

# Query PostgreSQL to display the enhanced state dimension table.
state_dimension_query = """
SELECT *
FROM public.dim_states
ORDER BY state_key;
"""

# Display the PostgreSQL state dimension table.
pd.read_sql(state_dimension_query, engine)
state_key state_code state_name
0 1 AL ALABAMA
1 2 AK ALASKA
2 3 AS AMERICAN SAMOA
3 4 AZ ARIZONA
4 5 AR ARKANSAS
5 6 BI BUREAU OF INDIAN EDUCATION
6 7 CA CALIFORNIA
7 8 CO COLORADO
8 9 CT CONNECTICUT
9 10 DE DELAWARE
10 11 DC DISTRICT OF COLUMBIA
11 12 FL FLORIDA
12 13 GA GEORGIA
13 14 GU GUAM
14 15 HI HAWAII
15 16 ID IDAHO
16 17 IL ILLINOIS
17 18 IN INDIANA
18 19 IA IOWA
19 20 KS KANSAS
20 21 KY KENTUCKY
21 22 LA LOUISIANA
22 23 ME MAINE
23 24 MD MARYLAND
24 25 MA MASSACHUSETTS
25 26 MI MICHIGAN
26 27 MN MINNESOTA
27 28 MS MISSISSIPPI
28 29 MO MISSOURI
29 30 MT MONTANA
30 31 NE NEBRASKA
31 32 NV NEVADA
32 33 NH NEW HAMPSHIRE
33 34 NJ NEW JERSEY
34 35 NM NEW MEXICO
35 36 NY NEW YORK
36 37 NC NORTH CAROLINA
37 38 ND NORTH DAKOTA
38 39 MP NORTHERN MARIANAS
39 40 OH OHIO
40 41 OK OKLAHOMA
41 42 OR OREGON
42 43 PA PENNSYLVANIA
43 44 PR PUERTO RICO
44 45 RI RHODE ISLAND
45 46 SC SOUTH CAROLINA
46 47 SD SOUTH DAKOTA
47 48 TN TENNESSEE
48 49 TX TEXAS
49 50 VI U.S. VIRGIN ISLANDS
50 51 UT UTAH
51 52 VT VERMONT
52 53 VA VIRGINIA
53 54 WA WASHINGTON
54 55 WV WEST VIRGINIA
55 56 WI WISCONSIN
56 57 WY WYOMING

Creating an Enhanced Fact Table with Surrogate Keys

To improve warehouse normalization and prepare for future multi-dataset integration, I create an enhanced fact table that connects to the state dimension using surrogate keys.

# Merge the existing district school counts fact table with the enhanced state dimension.
df_fact_enhanced = fact_district_school_counts.merge(
    df_states_dim[
        ["state_key", "state_code"]
    ],
    on="state_code",
    how="left"
)

# Reorder columns for warehouse modeling.
df_fact_enhanced = df_fact_enhanced[
    [
        "district_id",
        "state_key",
        "state_code",
        "operational_school_count"
    ]
]

# Display the enhanced fact table.
df_fact_enhanced.head()
district_id state_key state_code operational_school_count
0 100002 1 AL 0
1 100005 1 AL 7
2 100006 1 AL 17
3 100007 1 AL 18
4 100008 1 AL 13

Loading the Enhanced District School Counts Fact Table into PostgreSQL

After adding the state surrogate key to the fact table, I load the enhanced fact table into PostgreSQL for improved warehouse-style analytics.

# Load the enhanced district school counts fact table into PostgreSQL.
df_fact_enhanced.to_sql(
    name="fact_district_school_counts_enhanced",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("Enhanced district school counts fact table loaded successfully into PostgreSQL.")
Enhanced district school counts fact table loaded successfully into PostgreSQL.

Verifying the Enhanced Fact Table in PostgreSQL

To confirm the enhanced fact table loaded correctly, I query PostgreSQL and display the warehouse-ready records with surrogate state keys.

# Query PostgreSQL to display the enhanced fact table.
enhanced_fact_query = """
SELECT *
FROM public.fact_district_school_counts_enhanced
LIMIT 10;
"""

# Display the enhanced PostgreSQL fact table.
pd.read_sql(enhanced_fact_query, engine)
district_id state_key state_code operational_school_count
0 100002 1 AL 0
1 100005 1 AL 7
2 100006 1 AL 17
3 100007 1 AL 18
4 100008 1 AL 13
5 100009 1 AL 5
6 100010 1 AL 0
7 100011 1 AL 4
8 100012 1 AL 5
9 100013 1 AL 5

Analyzing Operational Schools Using the Enhanced Warehouse Model

To validate the enhanced warehouse model, I run an analytical SQL query using the enhanced fact table and the state dimension table.

This confirms that the surrogate-key relationship supports state-level aggregation correctly.

# SQL query using the enhanced fact table and state surrogate key relationship.
enhanced_state_analysis_query = read_sql_file("state_school_counts_with_surrogate_keys.sql")
# Display the top states using the enhanced warehouse model.
pd.read_sql(enhanced_state_analysis_query, engine)
state_name total_operational_schools
0 CALIFORNIA 10303.0
1 TEXAS 9192.0
2 NEW YORK 4810.0
3 ILLINOIS 4398.0
4 FLORIDA 4186.0
5 OHIO 3543.0
6 MICHIGAN 3468.0
7 PENNSYLVANIA 2932.0
8 MINNESOTA 2741.0
9 NORTH CAROLINA 2705.0

Visualizing Operational Schools Using the Enhanced Warehouse Model

To make the enhanced warehouse analysis easier to interpret, I create a visualization using the surrogate-key fact table and state dimension relationship.

# Store the enhanced SQL analysis result in a DataFrame.
enhanced_state_analysis_df = pd.read_sql(enhanced_state_analysis_query, engine)

# Create a horizontal bar chart.
plt.figure(figsize=(10, 6))
plt.barh(
    enhanced_state_analysis_df["state_name"],
    enhanced_state_analysis_df["total_operational_schools"]
)

# Display the highest values at the top.
plt.gca().invert_yaxis()

# Add chart labels and title.
plt.title("Top 10 States by Operational Schools - Enhanced Warehouse Model")
plt.xlabel("Total Operational Schools")
plt.ylabel("State")

# Improve chart spacing.
plt.subplots_adjust(left=0.28)

# Display the visualization.
plt.show()

Exploring the NAEP Dataset Structure

Before integrating NAEP data into the warehouse, I first inspect the structure of the raw NAEP files.

This helps identify available columns, reporting formats, and educational metrics across years.

# Load the NAEP dataset using the correct header row.
naep_clean_df = pd.read_excel(
    "../data/raw/naep/naep_math_grade8_state_2009_2024.Xls",
    header=8
)

# Display the cleaned NAEP dataset.
naep_clean_df.head()
Year Jurisdiction All students Average scale score
0 2024 National All students 273.833456
1 2024 Alabama All students 261.769896
2 2024 Alaska All students 263.962653
3 2024 Arizona All students 269.656164
4 2024 Arkansas All students 266.194834

Cleaning and Standardizing the NAEP Dataset

To prepare the NAEP dataset for warehouse integration, I standardize column names and select the core analytical fields.

# Standardize NAEP column names.
naep_clean_df.columns = [
    "reporting_year",
    "jurisdiction",
    "student_group",
    "average_scale_score"
]

# Keep only the required analytical columns.
naep_clean_df = naep_clean_df[
    [
        "reporting_year",
        "jurisdiction",
        "student_group",
        "average_scale_score"
    ]
]

# Display the cleaned NAEP dataset.
naep_clean_df.head()
reporting_year jurisdiction student_group average_scale_score
0 2024 National All students 273.833456
1 2024 Alabama All students 261.769896
2 2024 Alaska All students 263.962653
3 2024 Arizona All students 269.656164
4 2024 Arkansas All students 266.194834

Validating the NAEP Dataset

Before integrating NAEP data into the warehouse, I perform several validation checks to confirm data quality and completeness.

# Count missing values in the NAEP dataset.
naep_missing_values = naep_clean_df.isnull().sum()

# Count total NAEP records.
naep_total_records = len(naep_clean_df)

# Display validation results.
naep_validation_results = pd.DataFrame({
    "validation_check": [
        "Total NAEP records",
        "Missing reporting years",
        "Missing jurisdictions",
        "Missing student groups",
        "Missing average scale scores"
    ],
    "result": [
        naep_total_records,
        naep_missing_values["reporting_year"],
        naep_missing_values["jurisdiction"],
        naep_missing_values["student_group"],
        naep_missing_values["average_scale_score"]
    ]
})

# Display NAEP validation results.
naep_validation_results
validation_check result
0 Total NAEP records 490
1 Missing reporting years 1
2 Missing jurisdictions 4
3 Missing student groups 4
4 Missing average scale scores 4

Removing Invalid NAEP Records

To improve warehouse data quality, I remove incomplete NAEP records containing missing analytical values.

# Remove rows with missing critical analytical values.
naep_clean_df = naep_clean_df.dropna(
    subset=[
        "reporting_year",
        "jurisdiction",
        "student_group",
        "average_scale_score"
    ]
)

# Reset the index after cleaning.
naep_clean_df = naep_clean_df.reset_index(drop=True)

# Display the cleaned NAEP dataset.
naep_clean_df.head()
reporting_year jurisdiction student_group average_scale_score
0 2024 National All students 273.833456
1 2024 Alabama All students 261.769896
2 2024 Alaska All students 263.962653
3 2024 Arizona All students 269.656164
4 2024 Arkansas All students 266.194834

Connecting NAEP Records to the Year Dimension

To prepare NAEP data for warehouse integration, I connect each NAEP record to the year dimension using the reporting year.

# Merge NAEP records with the year dimension to add year_key.
naep_with_year_key = naep_clean_df.merge(
    df_year_dim,
    on="reporting_year",
    how="left"
)

# Display NAEP records with year dimension keys.
naep_with_year_key.head()
reporting_year jurisdiction student_group average_scale_score year_key
0 2024 National All students 273.833456 8.0
1 2024 Alabama All students 261.769896 8.0
2 2024 Alaska All students 263.962653 8.0
3 2024 Arizona All students 269.656164 8.0
4 2024 Arkansas All students 266.194834 8.0

Connecting NAEP Records to the State Dimension

To integrate NAEP educational outcomes into the warehouse model, I connect NAEP jurisdictions to the state dimension table.

# Merge NAEP records with the state dimension table.
naep_warehouse_df = naep_with_year_key.merge(
    df_states_dim[
        ["state_key", "state_name"]
    ],
    left_on="jurisdiction",
    right_on="state_name",
    how="left"
)

# Display NAEP records connected to warehouse state keys.
naep_warehouse_df.head()
reporting_year jurisdiction student_group average_scale_score year_key state_key state_name
0 2024 National All students 273.833456 8.0 NaN NaN
1 2024 Alabama All students 261.769896 8.0 NaN NaN
2 2024 Alaska All students 263.962653 8.0 NaN NaN
3 2024 Arizona All students 269.656164 8.0 NaN NaN
4 2024 Arkansas All students 266.194834 8.0 NaN NaN

Standardizing NAEP Jurisdiction Names for Warehouse Joins

Before joining NAEP records to the state dimension, I standardize jurisdiction names so they match the uppercase format used in the warehouse dimension table.

# Create a standardized uppercase jurisdiction field for warehouse joins.
naep_with_year_key["state_name"] = naep_with_year_key["jurisdiction"].str.upper()

# Merge NAEP records with the state dimension table using standardized state names.
naep_warehouse_df = naep_with_year_key.merge(
    df_states_dim[
        ["state_key", "state_name"]
    ],
    on="state_name",
    how="left"
)

# Display NAEP records connected to warehouse state keys.
naep_warehouse_df.head()
reporting_year jurisdiction student_group average_scale_score year_key state_name state_key
0 2024 National All students 273.833456 8.0 NATIONAL NaN
1 2024 Alabama All students 261.769896 8.0 ALABAMA 1.0
2 2024 Alaska All students 263.962653 8.0 ALASKA 2.0
3 2024 Arizona All students 269.656164 8.0 ARIZONA 4.0
4 2024 Arkansas All students 266.194834 8.0 ARKANSAS 5.0

Creating the NAEP Math Performance Fact Table

After connecting NAEP records to the year and state dimensions, I create a warehouse-ready fact table for Grade 8 mathematics performance.

This table stores state-level NAEP math scores by reporting year and connects to the warehouse using surrogate keys.

# Create a NAEP math performance fact table using warehouse surrogate keys.
fact_naep_math_performance = naep_warehouse_df[
    [
        "state_key",
        "year_key",
        "jurisdiction",
        "student_group",
        "average_scale_score"
    ]
].copy()

# Remove records that did not match the state dimension.
# This excludes national aggregate rows and other non-state records.
fact_naep_math_performance = fact_naep_math_performance.dropna(
    subset=[
        "state_key",
        "year_key",
        "average_scale_score"
    ]
)

# Convert surrogate keys to integers after removing missing values.
fact_naep_math_performance["state_key"] = fact_naep_math_performance["state_key"].astype(int)
fact_naep_math_performance["year_key"] = fact_naep_math_performance["year_key"].astype(int)

# Display the NAEP math performance fact table.
fact_naep_math_performance.head()
state_key year_key jurisdiction student_group average_scale_score
1 1 8 Alabama All students 261.769896
2 2 8 Alaska All students 263.962653
3 4 8 Arizona All students 269.656164
4 5 8 Arkansas All students 266.194834
5 7 8 California All students 268.779818

Loading the NAEP Math Performance Fact Table into PostgreSQL

After creating the NAEP math performance fact table, I load it into PostgreSQL so it can be queried with SQL and used for longitudinal educational outcome analysis.

# Load the NAEP math performance fact table into PostgreSQL.
fact_naep_math_performance.to_sql(
    name="fact_naep_math_performance",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("NAEP math performance fact table loaded successfully into PostgreSQL.")
NAEP math performance fact table loaded successfully into PostgreSQL.

Verifying the NAEP Math Performance Fact Table in PostgreSQL

To confirm the NAEP math performance fact table loaded correctly, I query PostgreSQL and preview the stored educational outcome records.

# Query PostgreSQL to preview the NAEP math performance fact table.
naep_math_preview_query = """
SELECT *
FROM public.fact_naep_math_performance
LIMIT 10;
"""

# Display the NAEP math performance records.
pd.read_sql(naep_math_preview_query, engine)
state_key year_key jurisdiction student_group average_scale_score
0 1 8 Alabama All students 261.769896296059
1 2 8 Alaska All students 263.962652757225
2 4 8 Arizona All students 269.656164092934
3 5 8 Arkansas All students 266.194833728906
4 7 8 California All students 268.779817602202
5 8 8 Colorado All students 278.107689530909
6 9 8 Connecticut All students 276.674248483695
7 10 8 Delaware All students 263.057137020663
8 11 8 District of Columbia All students 261.538791377155
9 12 8 Florida All students 267.242907869627

Analyzing Top States by NAEP Grade 8 Math Performance

To begin analyzing educational outcomes, I query PostgreSQL to identify the top 10 states by Grade 8 NAEP mathematics average scale score.

# SQL query to identify the top 10 states by NAEP Grade 8 math performance.
top_naep_math_states_query = read_sql_file("naep_top_states_2024.sql")
# Display the top 10 states by NAEP Grade 8 math score.
pd.read_sql(top_naep_math_states_query, engine)
state_name reporting_year average_scale_score
0 MASSACHUSETTS 2024 283.465558985794
1 WISCONSIN 2024 282.649633523366
2 MINNESOTA 2024 282.087420612381
3 UTAH 2024 281.79013940125
4 NEW JERSEY 2024 281.684697993032
5 SOUTH DAKOTA 2024 281.101137171665
6 NEBRASKA 2024 279.8834957478
7 NORTH DAKOTA 2024 279.765445425712
8 NEW HAMPSHIRE 2024 279.7446422881
9 MONTANA 2024 279.082095551632

Visualizing Top NAEP Grade 8 Math Performance States

To better interpret educational outcome performance, I create a visualization showing the top 10 states by 2024 NAEP Grade 8 mathematics average scale score.

# Store the SQL query result in a DataFrame.
top_naep_math_states_df = pd.read_sql(
    top_naep_math_states_query,
    engine
)

# Convert scores to numeric values and round for readability.
top_naep_math_states_df["average_scale_score"] = pd.to_numeric(
    top_naep_math_states_df["average_scale_score"]
).round(2)

# Sort values so the highest-performing states appear at the top.
top_naep_math_states_df = top_naep_math_states_df.sort_values(
    by="average_scale_score",
    ascending=True
)

# Set a focused baseline so differences between close NAEP scores are visible.
score_baseline = 260

# Create the chart.
plt.figure(figsize=(14, 7))

plt.barh(
    top_naep_math_states_df["state_name"],
    top_naep_math_states_df["average_scale_score"] - score_baseline,
    left=score_baseline
)

# Add score labels at the end of each bar.
for index, value in enumerate(top_naep_math_states_df["average_scale_score"]):
    plt.text(
        value + 0.15,
        index,
        f"{value:.2f}",
        va="center"
    )

# Force the state names to display on the y-axis.
plt.yticks(
    range(len(top_naep_math_states_df)),
    top_naep_math_states_df["state_name"]
)

# Focus the x-axis range for better comparison visibility.
plt.xlim(278, 285)

# Add chart title and labels.
plt.title("Top 10 States by 2024 NAEP Grade 8 Math Scores")
plt.xlabel("Average Scale Score")
plt.ylabel("State")

# Add enough left margin so state names do not get clipped.
plt.subplots_adjust(left=0.28)

# Display chart.
plt.show()

Analyzing Lowest States by NAEP Grade 8 Math Performance

To better understand educational performance variation across states, I also analyze the lowest-performing states by 2024 NAEP Grade 8 mathematics average scale score.

# SQL query to identify the lowest 10 states by NAEP Grade 8 math performance.
lowest_naep_math_states_query = read_sql_file("naep_lowest_states_2024.sql")
# Display the lowest-performing states by NAEP Grade 8 math score.
pd.read_sql(lowest_naep_math_states_query, engine)
state_name reporting_year average_scale_score
0 PUERTO RICO 2024 216.087040632243
1 NEW MEXICO 2024 256.210854081077
2 WEST VIRGINIA 2024 260.769078185284
3 DISTRICT OF COLUMBIA 2024 261.538791377155
4 ALABAMA 2024 261.769896296059
5 DELAWARE 2024 263.057137020663
6 ALASKA 2024 263.962652757225
7 OKLAHOMA 2024 264.454031154883
8 NEVADA 2024 265.177161678326
9 ARKANSAS 2024 266.194833728906

Visualizing Lowest NAEP Grade 8 Math Performance States

To better understand lower-performing educational outcome trends, I create a visualization showing the lowest-performing states by 2024 NAEP Grade 8 mathematics average scale score. Puerto Rico appears in the NAEP source data but is excluded from this state-level visualization because the chart focuses on U.S. states only.

# Store the SQL query result in a DataFrame.
lowest_naep_math_states_df = pd.read_sql(
    lowest_naep_math_states_query,
    engine
)

# Convert scores to numeric values and round for readability.
lowest_naep_math_states_df["average_scale_score"] = pd.to_numeric(
    lowest_naep_math_states_df["average_scale_score"]
).round(2)

# Remove Puerto Rico so the visualization scale remains readable.
lowest_naep_math_states_df = lowest_naep_math_states_df[
    lowest_naep_math_states_df["state_name"] != "PUERTO RICO"
]

# Sort values for visualization.
lowest_naep_math_states_df = lowest_naep_math_states_df.sort_values(
    by="average_scale_score",
    ascending=True
)

# Create the chart.
plt.figure(figsize=(14, 7))

plt.barh(
    lowest_naep_math_states_df["state_name"],
    lowest_naep_math_states_df["average_scale_score"]
)

# Add score labels.
for index, value in enumerate(lowest_naep_math_states_df["average_scale_score"]):
    plt.text(
        value + 0.1,
        index,
        f"{value:.2f}",
        va="center"
    )

# Focus the x-axis range.
plt.xlim(255, 267)

# Add chart labels and title.
plt.title("Lowest Performing States by 2024 NAEP Grade 8 Math Scores")
plt.xlabel("Average Scale Score")
plt.ylabel("State")

# Improve spacing so labels fit properly.
plt.subplots_adjust(left=0.28)

# Display chart.
plt.show()

Analyzing Post-Pandemic NAEP Math Score Changes

To quantify the post-pandemic change in educational outcomes, I compare each selected state’s 2019 NAEP Grade 8 mathematics score against its 2024 score.

# SQL query to compare 2019 and 2024 NAEP Grade 8 math scores.
post_pandemic_change_query = read_sql_file("naep_post_pandemic_delta.sql")
# Display the post-pandemic score change results.
pd.read_sql(post_pandemic_change_query, engine)
state_name score_2019 score_2024 score_change
0 MASSACHUSETTS 294.474575 283.465559 -11.009016
1 NEW JERSEY 291.822772 281.684698 -10.138074
2 MINNESOTA 290.785141 282.087421 -8.697721
3 UTAH 284.930389 281.790139 -3.140250

Data Extraction — CRDC LEA Characteristics Dataset

To begin integrating education technology and district-level characteristics into the warehouse, I load the CRDC LEA Characteristics dataset.

This dataset will help support future analysis of district characteristics, technology access indicators, and educational equity patterns.

# Define the relative path to the CRDC LEA Characteristics dataset.
crdc_file_path = "../data/raw/crdc/LEA Characteristics.csv"

# Load the CRDC CSV file into a pandas DataFrame.
df_crdc = pd.read_csv(crdc_file_path)

# Display the first five records to confirm the dataset loaded correctly.
df_crdc.head()
C:\Users\User\AppData\Local\Temp\ipykernel_21316\130108529.py:5: DtypeWarning: Columns (0: LEAID) have mixed types. Specify dtype option on import or set low_memory=False.
  df_crdc = pd.read_csv(crdc_file_path)
LEA_STATE LEA_STATE_NAME LEAID LEA_NAME LEA_ADDRESS LEA_CITY LEA_ZIP CJJ LEA_CRCOORD_SEX_IND LEA_CRCOORD_RAC_IND ... LEA_HBPOLICY_IND LEA_HBPOLICYURL_IND LEA_HBPOLICY_URL LEA_ENR LEA_PSENR_A3 LEA_PSENR_A4 LEA_PSENR_A5 LEA_ENR_NONLEAFAC LEA_SCHOOLS LEA_PS_IND
0 MA MASSACHUSETTS 2506090 Hingham 220 Central Street Hingham 2043 No Yes Yes ... Yes Yes https://z2policy.ctspublish.com/masc/browse/hi... 3912 0 0 24 38 6 Yes
1 MI MICHIGAN 2627210 Owosso Public Schools 645 Alger Ave Owosso 48867 No Yes Yes ... Yes Yes owosso.k12.mi.us 2994 34 41 0 0 7 Yes
2 TX TEXAS 4800278 CROSSTIMBERS ACADEMY 236 HARMONY RD WEATHERFORD 76087 No Yes Yes ... Yes Yes https://ctacharter.com/ 149 -9 -9 -9 0 1 No
3 OK OKLAHOMA 4012390 GANS 204 N STACY GANS 74936 No Yes Yes ... Yes Yes www.gans.k12.ok.us 351 16 22 30 0 2 Yes
4 CA CALIFORNIA 0619230 Junction Elementary 98821 Highway 96 Somes Bar 95568 No Yes Yes ... Yes No -9 16 -9 -9 -9 0 1 No

5 rows × 40 columns

Initial Data Inspection — CRDC Dataset

In this step, I inspect the CRDC dataset to understand its size, structure, and available fields before selecting columns for warehouse integration.

# Display the number of rows and columns in the CRDC dataset.
crdc_row_count, crdc_column_count = df_crdc.shape

print(f"Number of CRDC rows: {crdc_row_count}")
print(f"Number of CRDC columns: {crdc_column_count}")

# Display all CRDC column names to identify fields useful for district characteristics and technology access analysis.
df_crdc.columns.tolist()
Number of CRDC rows: 17704
Number of CRDC columns: 40
['LEA_STATE',
 'LEA_STATE_NAME',
 'LEAID',
 'LEA_NAME',
 'LEA_ADDRESS',
 'LEA_CITY',
 'LEA_ZIP',
 'CJJ',
 'LEA_CRCOORD_SEX_IND',
 'LEA_CRCOORD_RAC_IND',
 'LEA_CRCOORD_DIS_IND',
 'LEA_CRCOORD_SEX_FN',
 'XLEA_CRCOORD_SEX_FN',
 'LEA_CRCOORD_SEX_LN',
 'XLEA_CRCOORD_SEX_LN',
 'LEA_CRCOORD_SEX_EM',
 'XLEA_CRCOORD_SEX_EM',
 'LEA_CRCOORD_RAC_FN',
 'XLEA_CRCOORD_RAC_FN',
 'LEA_CRCOORD_RAC_LN',
 'XLEA_CRCOORD_RAC_LN',
 'LEA_CRCOORD_RAC_EM',
 'XLEA_CRCOORD_RAC_EM',
 'LEA_CRCOORD_DIS_FN',
 'XLEA_CRCOORD_DIS_FN',
 'LEA_CRCOORD_DIS_LN',
 'XLEA_CRCOORD_DIS_LN',
 'LEA_CRCOORD_DIS_EM',
 'XLEA_CRCOORD_DIS_EM',
 'LEA_DESEGPLAN',
 'LEA_HBPOLICY_IND',
 'LEA_HBPOLICYURL_IND',
 'LEA_HBPOLICY_URL',
 'LEA_ENR',
 'LEA_PSENR_A3',
 'LEA_PSENR_A4',
 'LEA_PSENR_A5',
 'LEA_ENR_NONLEAFAC',
 'LEA_SCHOOLS',
 'LEA_PS_IND']

Selecting CRDC District Characteristic Columns

After inspecting the CRDC dataset, I select district-level fields that can support warehouse integration and future educational equity analysis.

# Select CRDC district characteristic columns for warehouse preparation.
crdc_columns = [
    "LEAID",
    "LEA_NAME",
    "LEA_STATE",
    "LEA_STATE_NAME",
    "LEA_CITY",
    "LEA_ENR",
    "LEA_SCHOOLS",
    "LEA_DESEGPLAN",
    "LEA_HBPOLICY_IND",
    "LEA_PS_IND"
]

# Create a district-focused CRDC DataFrame.
df_crdc_districts = df_crdc[crdc_columns].copy()

# Display the selected CRDC district characteristic fields.
df_crdc_districts.head()
LEAID LEA_NAME LEA_STATE LEA_STATE_NAME LEA_CITY LEA_ENR LEA_SCHOOLS LEA_DESEGPLAN LEA_HBPOLICY_IND LEA_PS_IND
0 2506090 Hingham MA MASSACHUSETTS Hingham 3912 6 No Yes Yes
1 2627210 Owosso Public Schools MI MICHIGAN Owosso 2994 7 No Yes Yes
2 4800278 CROSSTIMBERS ACADEMY TX TEXAS WEATHERFORD 149 1 No Yes No
3 4012390 GANS OK OKLAHOMA GANS 351 2 No Yes Yes
4 0619230 Junction Elementary CA CALIFORNIA Somes Bar 16 1 No Yes No

Cleaning the CRDC District Characteristics Dataset

Before loading CRDC district characteristics into PostgreSQL, I standardize column names and remove duplicate district records.

# Create a copy of the selected CRDC district fields.
df_crdc_district_dim = df_crdc_districts.copy()

# Remove duplicate district records using the LEAID identifier.
df_crdc_district_dim = df_crdc_district_dim.drop_duplicates(subset=["LEAID"])

# Standardize column names for PostgreSQL compatibility.
df_crdc_district_dim.columns = [
    "district_id",
    "district_name",
    "state_code",
    "state_name",
    "city",
    "enrollment",
    "school_count",
    "desegregation_plan_indicator",
    "harassment_bullying_policy_indicator",
    "preschool_indicator"
]

# Reset the index after cleaning.
df_crdc_district_dim = df_crdc_district_dim.reset_index(drop=True)

# Display the cleaned CRDC district characteristics dataset.
df_crdc_district_dim.head()
district_id district_name state_code state_name city enrollment school_count desegregation_plan_indicator harassment_bullying_policy_indicator preschool_indicator
0 2506090 Hingham MA MASSACHUSETTS Hingham 3912 6 No Yes Yes
1 2627210 Owosso Public Schools MI MICHIGAN Owosso 2994 7 No Yes Yes
2 4800278 CROSSTIMBERS ACADEMY TX TEXAS WEATHERFORD 149 1 No Yes No
3 4012390 GANS OK OKLAHOMA GANS 351 2 No Yes Yes
4 0619230 Junction Elementary CA CALIFORNIA Somes Bar 16 1 No Yes No

Loading CRDC District Characteristics into PostgreSQL

After cleaning the CRDC district characteristics dataset, I load it into PostgreSQL as a warehouse table for future district-level enrichment analysis.

# Load the cleaned CRDC district characteristics table into PostgreSQL.
df_crdc_district_dim.to_sql(
    name="dim_crdc_district_characteristics",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("CRDC district characteristics table loaded successfully into PostgreSQL.")
CRDC district characteristics table loaded successfully into PostgreSQL.

Verifying the CRDC District Characteristics Table in PostgreSQL

To confirm the CRDC district characteristics table loaded correctly, I query PostgreSQL and preview the stored records.

# Query PostgreSQL to preview the CRDC district characteristics table.
crdc_preview_query = """
SELECT *
FROM public.dim_crdc_district_characteristics
LIMIT 10;
"""

# Display the CRDC district characteristics records.
pd.read_sql(crdc_preview_query, engine)
district_id district_name state_code state_name city enrollment school_count desegregation_plan_indicator harassment_bullying_policy_indicator preschool_indicator
0 2506090 Hingham MA MASSACHUSETTS Hingham 3912 6 No Yes Yes
1 2627210 Owosso Public Schools MI MICHIGAN Owosso 2994 7 No Yes Yes
2 4800278 CROSSTIMBERS ACADEMY TX TEXAS WEATHERFORD 149 1 No Yes No
3 4012390 GANS OK OKLAHOMA GANS 351 2 No Yes Yes
4 0619230 Junction Elementary CA CALIFORNIA Somes Bar 16 1 No Yes No
5 2913290 GREEN RIDGE R-VIII MO MISSOURI GREEN RIDGE 350 2 No Yes No
6 3000004 Mountain View Elem MT MONTANA Cut Bank 11 1 No Yes No
7 0805940 Otis School District No. R-3 CO COLORADO OTIS 211 2 No Yes Yes
8 0602174 iLead Agua Dulce District CA CALIFORNIA Agua Dulce 320 1 No Yes No
9 0624600 Merced City Elementary CA CALIFORNIA Merced 10922 19 No Yes Yes

Validating the CRDC District Characteristics Dataset

Before using the CRDC district characteristics data for analysis, I perform validation checks to confirm completeness and uniqueness.

# Count missing district identifiers.
crdc_missing_district_ids = df_crdc_district_dim["district_id"].isnull().sum()

# Count missing state names.
crdc_missing_state_names = df_crdc_district_dim["state_name"].isnull().sum()

# Count missing enrollment values.
crdc_missing_enrollment = df_crdc_district_dim["enrollment"].isnull().sum()

# Count duplicate district identifiers.
crdc_duplicate_district_ids = df_crdc_district_dim["district_id"].duplicated().sum()

# Display validation results.
crdc_validation_results = pd.DataFrame({
    "validation_check": [
        "Missing district IDs",
        "Missing state names",
        "Missing enrollment values",
        "Duplicate district IDs"
    ],
    "result": [
        crdc_missing_district_ids,
        crdc_missing_state_names,
        crdc_missing_enrollment,
        crdc_duplicate_district_ids
    ]
})

crdc_validation_results
validation_check result
0 Missing district IDs 0
1 Missing state names 0
2 Missing enrollment values 0
3 Duplicate district IDs 0

Analyzing CRDC District Enrollment by State

To begin using the CRDC district characteristics table analytically, I summarize total district enrollment by state.

# SQL query to summarize total CRDC enrollment by state.
crdc_enrollment_by_state_query = read_sql_file("crdc_enrollment_by_state.sql")
# Display the top states by total CRDC district enrollment.
pd.read_sql(crdc_enrollment_by_state_query, engine)
state_name total_enrollment
0 CALIFORNIA 5958775.0
1 TEXAS 5419285.0
2 FLORIDA 2961487.0
3 NEW YORK 2745258.0
4 ILLINOIS 1882620.0
5 GEORGIA 1751403.0
6 OHIO 1717635.0
7 PENNSYLVANIA 1707999.0
8 NORTH CAROLINA 1537968.0
9 MICHIGAN 1441417.0

Visualizing CRDC District Enrollment by State

To better understand district enrollment distribution, I create a visualization showing the states with the highest total CRDC district enrollment.

# Store the CRDC enrollment query results in a DataFrame.
crdc_enrollment_df = pd.read_sql(
    crdc_enrollment_by_state_query,
    engine
)

# Convert enrollment values to numeric.
crdc_enrollment_df["total_enrollment"] = pd.to_numeric(
    crdc_enrollment_df["total_enrollment"]
)

# Sort values for visualization.
crdc_enrollment_df = crdc_enrollment_df.sort_values(
    by="total_enrollment",
    ascending=True
)

# Create the chart.
plt.figure(figsize=(12, 6))

plt.barh(
    crdc_enrollment_df["state_name"],
    crdc_enrollment_df["total_enrollment"]
)

# Add chart title and labels.
plt.title("Top States by Total CRDC District Enrollment")
plt.xlabel("Total Enrollment")
plt.ylabel("State")

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

Preparing CRDC and NAEP State-Level Comparison Data

To begin connecting district characteristics with educational outcomes, I aggregate CRDC enrollment data to the state level and prepare it for comparison with NAEP state performance results.

# SQL query to aggregate CRDC district enrollment to the state level.
crdc_state_enrollment_query = """
SELECT
    state_name,
    SUM(CAST(enrollment AS NUMERIC)) AS total_enrollment
FROM public.dim_crdc_district_characteristics
GROUP BY state_name
ORDER BY state_name;
"""

# Display state-level CRDC enrollment totals.
pd.read_sql(crdc_state_enrollment_query, engine)
state_name total_enrollment
0 ALABAMA 738221.0
1 ALASKA 129730.0
2 ARIZONA 1127982.0
3 ARKANSAS 490007.0
4 CALIFORNIA 5958775.0
5 COLORADO 886633.0
6 CONNECTICUT 510407.0
7 DELAWARE 140631.0
8 DISTRICT OF COLUMBIA 89300.0
9 FLORIDA 2961487.0
10 GEORGIA 1751403.0
11 HAWAII 173178.0
12 IDAHO 315523.0
13 ILLINOIS 1882620.0
14 INDIANA 1038088.0
15 IOWA 510553.0
16 KANSAS 481676.0
17 KENTUCKY 653132.0
18 LOUISIANA 664811.0
19 MAINE 164334.0
20 MARYLAND 883833.0
21 MASSACHUSETTS 919041.0
22 MICHIGAN 1441417.0
23 MINNESOTA 890878.0
24 MISSISSIPPI 485450.0
25 MISSOURI 899203.0
26 MONTANA 160320.0
27 NEBRASKA 325764.0
28 NEVADA 486753.0
29 NEW HAMPSHIRE 171005.0
30 NEW JERSEY 1377021.0
31 NEW MEXICO 309714.0
32 NEW YORK 2745258.0
33 NORTH CAROLINA 1537968.0
34 NORTH DAKOTA 116594.0
35 OHIO 1717635.0
36 OKLAHOMA 699604.0
37 OREGON 553921.0
38 PENNSYLVANIA 1707999.0
39 PUERTO RICO 259535.0
40 RHODE ISLAND 181359.0
41 SOUTH CAROLINA 742636.0
42 SOUTH DAKOTA 141475.0
43 TENNESSEE 989550.0
44 TEXAS 5419285.0
45 UTAH 705970.0
46 VERMONT 82602.0
47 VIRGINIA 1291606.0
48 WASHINGTON 1091100.0
49 WEST VIRGINIA 248994.0
50 WISCONSIN 829306.0
51 WYOMING 92493.0

Joining CRDC Enrollment with 2024 NAEP Math Performance

To begin comparing district characteristics with educational outcomes, I join state-level CRDC enrollment totals with 2024 NAEP Grade 8 mathematics performance scores.

# SQL query to join CRDC state enrollment totals with 2024 NAEP math performance.
crdc_naep_comparison_query = read_sql_file("crdc_naep_comparison.sql")
# Display the joined CRDC and NAEP comparison dataset.
pd.read_sql(crdc_naep_comparison_query, engine)
state_name total_enrollment average_scale_score
0 MASSACHUSETTS 919041.0 283.465559
1 WISCONSIN 829306.0 282.649634
2 MINNESOTA 890878.0 282.087421
3 UTAH 705970.0 281.790139
4 NEW JERSEY 1377021.0 281.684698
5 SOUTH DAKOTA 141475.0 281.101137
6 NEBRASKA 325764.0 279.883496
7 NORTH DAKOTA 116594.0 279.765445
8 NEW HAMPSHIRE 171005.0 279.744642
9 MONTANA 160320.0 279.082096
10 OHIO 1717635.0 278.812721
11 WYOMING 92493.0 278.315324
12 INDIANA 1038088.0 278.185064
13 COLORADO 886633.0 278.107690
14 IDAHO 315523.0 278.104479
15 ILLINOIS 1882620.0 277.416366
16 CONNECTICUT 510407.0 276.674248
17 PENNSYLVANIA 1707999.0 276.245915
18 TENNESSEE 989550.0 275.982988
19 NORTH CAROLINA 1537968.0 275.821518
20 VERMONT 82602.0 275.617661
21 VIRGINIA 1291606.0 274.988690
22 IOWA 510553.0 274.823931
23 KANSAS 481676.0 274.136728
24 WASHINGTON 1091100.0 273.665174
25 MAINE 164334.0 272.621883
26 NEW YORK 2745258.0 270.851724
27 KENTUCKY 653132.0 270.813300
28 MISSOURI 899203.0 270.363197
29 HAWAII 173178.0 270.041103
30 MICHIGAN 1441417.0 269.978774
31 RHODE ISLAND 181359.0 269.826789
32 ARIZONA 1127982.0 269.656164
33 TEXAS 5419285.0 269.393041
34 MISSISSIPPI 485450.0 269.060485
35 CALIFORNIA 5958775.0 268.779818
36 GEORGIA 1751403.0 268.655805
37 MARYLAND 883833.0 268.213760
38 SOUTH CAROLINA 742636.0 268.006371
39 OREGON 553921.0 267.940476
40 FLORIDA 2961487.0 267.242908
41 LOUISIANA 664811.0 266.783264
42 ARKANSAS 490007.0 266.194834
43 NEVADA 486753.0 265.177162
44 OKLAHOMA 699604.0 264.454031
45 ALASKA 129730.0 263.962653
46 DELAWARE 140631.0 263.057137
47 ALABAMA 738221.0 261.769896
48 DISTRICT OF COLUMBIA 89300.0 261.538791
49 WEST VIRGINIA 248994.0 260.769078
50 NEW MEXICO 309714.0 256.210854
51 PUERTO RICO 259535.0 216.087041

Visualizing CRDC Enrollment and NAEP Math Performance

To begin exploring relationships between district characteristics and educational outcomes, I create a scatterplot comparing total CRDC district enrollment with 2024 NAEP Grade 8 mathematics performance.

# Store the joined CRDC and NAEP comparison query result in a DataFrame.
crdc_naep_comparison_df = pd.read_sql(
    crdc_naep_comparison_query,
    engine
)

# Convert numeric fields.
crdc_naep_comparison_df["total_enrollment"] = pd.to_numeric(
    crdc_naep_comparison_df["total_enrollment"]
)

crdc_naep_comparison_df["average_scale_score"] = pd.to_numeric(
    crdc_naep_comparison_df["average_scale_score"]
).round(2)

# Select the top 10 states by enrollment.
top_enrollment_states = crdc_naep_comparison_df.nlargest(
    10,
    "total_enrollment"
)

# Sort values for visualization.
top_enrollment_states = top_enrollment_states.sort_values(
    by="total_enrollment",
    ascending=True
)

# Create the chart.
plt.figure(figsize=(12, 6))

bars = plt.barh(
    top_enrollment_states["state_name"],
    top_enrollment_states["total_enrollment"]
)

# Add NAEP score labels to bars.
for bar, score in zip(
    bars,
    top_enrollment_states["average_scale_score"]
):
    plt.text(
        bar.get_width(),
        bar.get_y() + bar.get_height() / 2,
        f" NAEP: {score}",
        va="center"
    )

# Add chart title and labels.
plt.title("Top States by Enrollment with 2024 NAEP Math Scores")
plt.xlabel("Total State Enrollment")
plt.ylabel("State")

# Add additional chart spacing for labels.
plt.xlim(
    0,
    top_enrollment_states["total_enrollment"].max() * 1.15
)

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

CRDC and NAEP Comparison Interpretation

This comparison connects CRDC district enrollment totals with 2024 NAEP Grade 8 mathematics performance scores.

The purpose of this step is not to prove a causal relationship, but to begin exploring whether large enrollment systems show different performance patterns than smaller state systems.

Key observations from this comparison include:

  • Larger enrollment states do not automatically produce higher NAEP mathematics scores.
  • Some high-enrollment states perform near the middle of the NAEP score range.
  • Smaller or mid-sized states may outperform larger systems on average scale scores.
  • Enrollment size alone is not enough to explain academic performance outcomes.
  • Future CRDC technology-access indicators will provide stronger explanatory variables for technology impact analysis.

This step establishes the first bridge between district characteristics and educational outcomes inside the warehouse.

Analyzing Enrollment Size and Educational Performance

Now I want to compare state enrollment totals against 2024 NAEP Grade 8 math scores. Two changes from the obvious approach matter here.

First, I use pd.qcut instead of pd.cut for the size buckets. With states like California and Texas at the top of the distribution, equal-width bins would put roughly two states in “Large” and everything else in “Small” — the means wouldn’t be comparable. Quantile-based bins put a roughly equal count of states in each bucket, which makes the group averages meaningful.

Second, I compute a Spearman rank correlation alongside the bucket summary. The buckets give a quick visual; the correlation gives an honest single number for whether enrollment scale and NAEP scores actually move together.

from scipy.stats import spearmanr

# Working copy of the joined CRDC enrollment + NAEP 2024 dataset.
enrollment_performance_analysis_df = crdc_naep_comparison_df.copy()

# Quantile-based size buckets: each tercile holds roughly the same number of states.
enrollment_performance_analysis_df["enrollment_tercile"] = pd.qcut(
    enrollment_performance_analysis_df["total_enrollment"],
    q=3,
    labels=["Small (bottom third)", "Medium (middle third)", "Large (top third)"]
)

# Mean score plus state count per bucket — count matters; it tells you how
# unbalanced the buckets actually are.
enrollment_performance_summary = (
    enrollment_performance_analysis_df
    .groupby("enrollment_tercile", observed=True)
    .agg(
        states_in_bucket=("state_name", "count"),
        mean_naep_score=("average_scale_score", "mean"),
    )
    .reset_index()
)

# Spearman rank correlation between enrollment and NAEP score.
# Rank-based, so robust to California/Texas being extreme outliers.
rho, p_value = spearmanr(
    enrollment_performance_analysis_df["total_enrollment"],
    enrollment_performance_analysis_df["average_scale_score"],
)

print(enrollment_performance_summary.to_string(index=False))
print()
print(f"Spearman rank correlation: rho = {rho:.3f}, p-value = {p_value:.3f}")
   enrollment_tercile  states_in_bucket  mean_naep_score
 Small (bottom third)                18       268.881667
Medium (middle third)                17       271.464118
    Large (top third)                17       274.167059

Spearman rank correlation: rho = 0.104, p-value = 0.462

Visualizing Enrollment Size and Educational Performance

A scatter plot is the right shape for this question. Each point is one state; the x-axis is total enrollment, the y-axis is the 2024 NAEP Grade 8 math score. If there’s a relationship, you see it in the cloud.

fig, ax = plt.subplots(figsize=(11, 6))

ax.scatter(
    enrollment_performance_analysis_df["total_enrollment"],
    enrollment_performance_analysis_df["average_scale_score"],
    s=40,
)

# Label the most extreme points so the chart isn't anonymous.
extremes = pd.concat([
    enrollment_performance_analysis_df.nlargest(3, "total_enrollment"),
    enrollment_performance_analysis_df.nsmallest(3, "total_enrollment"),
    enrollment_performance_analysis_df.nlargest(2, "average_scale_score"),
    enrollment_performance_analysis_df.nsmallest(2, "average_scale_score"),
]).drop_duplicates(subset=["state_name"])

for _, row in extremes.iterrows():
    ax.annotate(
        row["state_name"].title(),
        (row["total_enrollment"], row["average_scale_score"]),
        xytext=(5, 5), textcoords="offset points", fontsize=9,
    )

ax.set_title(f"State Enrollment vs. 2024 NAEP Grade 8 Math Score (Spearman ρ = {rho:.2f})")
ax.set_xlabel("Total State Enrollment (CRDC)")
ax.set_ylabel("2024 NAEP Grade 8 Math — Average Scale Score")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Enrollment Size and Educational Performance Interpretation

The Spearman correlation is close to zero, which is the honest answer: across U.S. states in 2024, total enrollment and NAEP Grade 8 math performance are not meaningfully related. Massachusetts is mid-sized and tops the chart; Texas and California are huge and sit near the middle of the score distribution; the highest-scoring small state isn’t dramatically higher than the highest-scoring large state.

A few caveats worth being explicit about:

  • Aggregate state-level data can’t answer a district-level question. Two states with the same average can have very different distributions underneath.
  • Total enrollment is a proxy for “system size,” not for any specific causal factor. Funding, demographics, instructional approach, and dozens of other things vary across states.
  • The bucket means are informative but the buckets are small (~17 states each). Don’t read too much into small differences between bucket means.

The takeaway for this project is more methodological than substantive: enrollment scale alone isn’t useful as an explanatory variable for academic performance, so the more interesting question is whether technology participation — distance education enrollment specifically — shows a relationship. That’s what the next section explores.

Project Summary and Key Findings

This project demonstrates the development of an end-to-end educational technology impact analysis workflow using publicly available education and technology-related datasets.

The primary objective is to evaluate educational outcome trends and explore how district-level characteristics, enrollment patterns, and technology-enabled learning participation may relate to student performance.

To support this analysis, I:

  • Collected educational and technology-related datasets from multiple public sources
  • Built a PostgreSQL dimensional warehouse environment
  • Designed ETL workflows for data integration and transformation
  • Connected district, state, year, educational outcome, and distance education datasets
  • Performed SQL-based analytical reporting
  • Created visualizations to communicate educational trends and comparisons
  • Evaluated post-pandemic educational performance patterns
  • Integrated CRDC district characteristics and distance education participation metrics into the warehouse environment

Key findings from the current analysis include:

  • Educational performance varies significantly across states.
  • Post-pandemic NAEP mathematics scores declined across many educational systems.
  • Larger enrollment systems do not necessarily produce stronger educational outcomes.
  • Distance education participation varies widely across states and districts.
  • Higher distance education participation alone does not automatically correspond to stronger NAEP mathematics outcomes.
  • Technology adoption appears most meaningful when evaluated alongside implementation quality, access, instructional context, enrollment scale, and educational equity.

This project demonstrates practical experience in data engineering, analytics engineering, PostgreSQL development, ETL pipeline design, SQL analytics, data visualization, and research-oriented reporting workflows.

Loading the CRDC Distance Education Dataset

To evaluate technology adoption and digital learning participation, I load the CRDC Distance Education dataset into the analytics workflow.

# Load the CRDC Distance Education dataset.
distance_education_path = "../data/raw/crdc/Distance Education.csv"

df_distance_education = pd.read_csv(
    distance_education_path,
    encoding="latin1",
    low_memory=False
)

# Display dataset dimensions.
print("Number of rows:", df_distance_education.shape[0])
print("Number of columns:", df_distance_education.shape[1])

# Display column names.
df_distance_education.columns.tolist()
Number of rows: 17604
Number of columns: 29
['LEA_STATE',
 'LEA_STATE_NAME',
 'LEAID',
 'LEA_NAME',
 'LEA_ADDRESS',
 'LEA_CITY',
 'LEA_ZIP',
 'CJJ',
 'LEA_DISTED_IND',
 'LEA_DISTEDENR_HI_M',
 'LEA_DISTEDENR_HI_F',
 'LEA_DISTEDENR_AM_M',
 'LEA_DISTEDENR_AM_F',
 'LEA_DISTEDENR_AS_M',
 'LEA_DISTEDENR_AS_F',
 'LEA_DISTEDENR_HP_M',
 'LEA_DISTEDENR_HP_F',
 'LEA_DISTEDENR_BL_M',
 'LEA_DISTEDENR_BL_F',
 'LEA_DISTEDENR_WH_M',
 'LEA_DISTEDENR_WH_F',
 'LEA_DISTEDENR_TR_M',
 'LEA_DISTEDENR_TR_F',
 'TOT_DISTEDENR_M',
 'TOT_DISTEDENR_F',
 'LEA_DISTEDENR_LEP_M',
 'LEA_DISTEDENR_LEP_F',
 'LEA_DISTEDENR_IDEA_M',
 'LEA_DISTEDENR_IDEA_F']

Cleaning and Standardizing the Distance Education Dataset

Before integrating the distance education dataset into the warehouse workflow, I standardize column names and prepare the dataset for analytical processing.

# Create a cleaned copy of the distance education dataset.
df_distance_education_clean = df_distance_education.copy()

# Standardize column names.
df_distance_education_clean.columns = (
    df_distance_education_clean.columns
    .str.lower()
    .str.strip()
)

# Rename key columns for readability.
df_distance_education_clean = df_distance_education_clean.rename(columns={
    "lea_state_name": "state_name",
    "leaid": "district_id",
    "lea_name": "district_name",
    "lea_disted_ind": "distance_education_indicator",
    "tot_distedenr_m": "male_distance_education_enrollment",
    "tot_distedenr_f": "female_distance_education_enrollment"
})

# Display cleaned column names.
df_distance_education_clean.columns.tolist()
['lea_state',
 'state_name',
 'district_id',
 'district_name',
 'lea_address',
 'lea_city',
 'lea_zip',
 'cjj',
 'distance_education_indicator',
 'lea_distedenr_hi_m',
 'lea_distedenr_hi_f',
 'lea_distedenr_am_m',
 'lea_distedenr_am_f',
 'lea_distedenr_as_m',
 'lea_distedenr_as_f',
 'lea_distedenr_hp_m',
 'lea_distedenr_hp_f',
 'lea_distedenr_bl_m',
 'lea_distedenr_bl_f',
 'lea_distedenr_wh_m',
 'lea_distedenr_wh_f',
 'lea_distedenr_tr_m',
 'lea_distedenr_tr_f',
 'male_distance_education_enrollment',
 'female_distance_education_enrollment',
 'lea_distedenr_lep_m',
 'lea_distedenr_lep_f',
 'lea_distedenr_idea_m',
 'lea_distedenr_idea_f']

Analyzing Distance Education Adoption by State

To begin evaluating technology adoption patterns, I summarize the number of districts reporting distance education participation by state.

# SQL-style aggregation using pandas.
distance_education_by_state = (
    df_distance_education_clean
    .groupby("state_name")["distance_education_indicator"]
    .apply(lambda x: (x == "Yes").sum())
    .reset_index(name="districts_using_distance_education")
)

# Sort results.
distance_education_by_state = (
    distance_education_by_state
    .sort_values(
        by="districts_using_distance_education",
        ascending=False
    )
)

# Display top states using distance education.
distance_education_by_state.head(10)
state_name districts_using_distance_education
44 TEXAS 506
22 MICHIGAN 364
38 PENNSYLVANIA 315
15 IOWA 275
35 OHIO 275
32 NEW YORK 272
3 ARKANSAS 247
4 CALIFORNIA 239
25 MISSOURI 229
50 WISCONSIN 220

Visualizing Distance Education Adoption by State

To better understand technology adoption patterns, I create a visualization showing the states with the highest number of districts participating in distance education.

# Select the top 10 states using distance education.
top_distance_education_states = (
    distance_education_by_state
    .head(10)
    .sort_values(
        by="districts_using_distance_education",
        ascending=True
    )
)

# Create the chart.
plt.figure(figsize=(12, 6))

plt.barh(
    top_distance_education_states["state_name"],
    top_distance_education_states["districts_using_distance_education"]
)

# Add chart title and labels.
plt.title("Top States by District Distance Education Adoption")
plt.xlabel("Districts Using Distance Education")
plt.ylabel("State")

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

Creating the Distance Education Adoption Fact Table

Before aggregating, a data-quality note: the CRDC distance-education file uses -9 as a reserved code for “missing or suppressed value,” not as a real student count. Roughly 65% of the districts in this file (11,384 out of 17,604) report -9 for both the male and female totals. If those codes are treated as numbers and summed, the resulting state totals are nonsense — several states end up with negative student counts. I replace the sentinel codes with NaN so they’re excluded from the sum rather than corrupting it.

# Pull the columns we need from the cleaned distance-ed dataset.
fact_distance_education = df_distance_education_clean[
    [
        "district_id",
        "state_name",
        "district_name",
        "distance_education_indicator",
        "male_distance_education_enrollment",
        "female_distance_education_enrollment"
    ]
].copy()

# Coerce to numeric. errors="coerce" turns non-numeric strings into NaN.
for col in ["male_distance_education_enrollment", "female_distance_education_enrollment"]:
    fact_distance_education[col] = pd.to_numeric(
        fact_distance_education[col], errors="coerce"
    )

# Replace CRDC sentinel codes for suppressed/missing data with NaN.
# Negative values are not valid student counts in this file.
sentinel_codes = [-3, -5, -6, -8, -9, -11]
for col in ["male_distance_education_enrollment", "female_distance_education_enrollment"]:
    fact_distance_education.loc[
        fact_distance_education[col].isin(sentinel_codes), col
    ] = np.nan

# Total enrollment per district. With sentinels nulled out, the sum treats
# missing values as missing instead of as -9.
fact_distance_education["total_distance_education_enrollment"] = (
    fact_distance_education["male_distance_education_enrollment"]
    .add(fact_distance_education["female_distance_education_enrollment"], fill_value=0)
)

# How much of the file was affected? Useful to show explicitly.
total_districts = len(fact_distance_education)
suppressed_districts = (
    fact_distance_education["male_distance_education_enrollment"].isna()
    & fact_distance_education["female_distance_education_enrollment"].isna()
).sum()
print(f"Districts in file: {total_districts:,}")
print(f"Districts with both M and F suppressed (no usable count): {suppressed_districts:,} "
      f"({suppressed_districts / total_districts:.1%})")

fact_distance_education.head()
Districts in file: 17,604
Districts with both M and F suppressed (no usable count): 11,384 (64.7%)
district_id state_name district_name distance_education_indicator male_distance_education_enrollment female_distance_education_enrollment total_distance_education_enrollment
0 0100002 ALABAMA Alabama Youth Services No NaN NaN NaN
1 0100005 ALABAMA Albertville City Yes 4.0 14.0 18.0
2 0100006 ALABAMA Marshall County Yes 64.0 75.0 139.0
3 0100007 ALABAMA Hoover City Yes 218.0 340.0 558.0
4 0100008 ALABAMA Madison City Yes 89.0 154.0 243.0

Loading the Distance Education Fact Table into PostgreSQL

After creating the distance education fact table, I load it into PostgreSQL so technology adoption metrics can be queried with SQL and connected to the broader education warehouse.

# Load the distance education fact table into PostgreSQL.
fact_distance_education.to_sql(
    name="fact_distance_education_adoption",
    con=engine,
    schema="public",
    if_exists="replace",
    index=False
)

# Confirm successful PostgreSQL table creation.
print("Distance education adoption fact table loaded successfully into PostgreSQL.")
Distance education adoption fact table loaded successfully into PostgreSQL.

Verifying the Distance Education Fact Table in PostgreSQL

To confirm the distance education adoption fact table loaded correctly, I query PostgreSQL and preview the stored technology adoption records.

# Query PostgreSQL to preview the distance education adoption fact table.
distance_education_preview_query = """
SELECT *
FROM public.fact_distance_education_adoption
LIMIT 10;
"""

# Display distance education adoption records.
pd.read_sql(distance_education_preview_query, engine)
district_id state_name district_name distance_education_indicator male_distance_education_enrollment female_distance_education_enrollment total_distance_education_enrollment
0 0100002 ALABAMA Alabama Youth Services No NaN NaN NaN
1 0100005 ALABAMA Albertville City Yes 4.0 14.0 18.0
2 0100006 ALABAMA Marshall County Yes 64.0 75.0 139.0
3 0100007 ALABAMA Hoover City Yes 218.0 340.0 558.0
4 0100008 ALABAMA Madison City Yes 89.0 154.0 243.0
5 0100009 ALABAMA Al Inst Deaf And Blind No NaN NaN NaN
6 0100011 ALABAMA Leeds City Yes 71.0 64.0 135.0
7 0100012 ALABAMA Boaz City Yes 13.0 32.0 45.0
8 0100013 ALABAMA Trussville City No NaN NaN NaN
9 0100018 ALABAMA Alabama School of Fine Arts No NaN NaN NaN

Analyzing Distance Education Enrollment by State

To evaluate the scale of technology-enabled learning participation, I summarize total distance education enrollment by state.

# SQL query to summarize distance education enrollment by state.
distance_education_enrollment_query = read_sql_file("distance_ed_top_states.sql")
# Display top states by distance education enrollment.
pd.read_sql(distance_education_enrollment_query, engine)
state_name total_distance_education_enrollment
0 TEXAS 103166.0
1 CALIFORNIA 98436.0
2 GEORGIA 72605.0
3 ARKANSAS 59193.0
4 MICHIGAN 51837.0
5 PENNSYLVANIA 48042.0
6 IOWA 43741.0
7 VIRGINIA 42440.0
8 NORTH CAROLINA 38877.0
9 OHIO 38492.0

Visualizing Distance Education Enrollment by State

To better understand technology-enabled learning participation, I create a visualization showing the states with the highest distance education enrollment totals.

# Store the distance education enrollment query results in a DataFrame.
distance_education_enrollment_df = pd.read_sql(
    distance_education_enrollment_query,
    engine
)

# Convert numeric values.
distance_education_enrollment_df[
    "total_distance_education_enrollment"
] = pd.to_numeric(
    distance_education_enrollment_df[
        "total_distance_education_enrollment"
    ]
)

# Sort values for visualization.
distance_education_enrollment_df = (
    distance_education_enrollment_df
    .sort_values(
        by="total_distance_education_enrollment",
        ascending=True
    )
)

# Create the chart.
plt.figure(figsize=(12, 6))

plt.barh(
    distance_education_enrollment_df["state_name"],
    distance_education_enrollment_df[
        "total_distance_education_enrollment"
    ]
)

# Add chart title and labels.
plt.title("Top States by Distance Education Enrollment")
plt.xlabel("Distance Education Enrollment")
plt.ylabel("State")

# Improve spacing.
plt.tight_layout()

# Display chart.
plt.show()

Connecting Distance Education Adoption with NAEP Math Performance

To begin evaluating the relationship between technology-enabled learning participation and student outcomes, I join state-level distance education enrollment with 2024 NAEP Grade 8 mathematics performance scores.

# SQL query joining distance education enrollment with 2024 NAEP math scores.
distance_education_naep_query = read_sql_file("distance_ed_naep_comparison.sql")
# Display the joined distance education and NAEP performance dataset.
pd.read_sql(distance_education_naep_query, engine)
state_name total_distance_education_enrollment average_scale_score
0 TEXAS 103166.0 269.393041
1 CALIFORNIA 98436.0 268.779818
2 GEORGIA 72605.0 268.655805
3 ARKANSAS 59193.0 266.194834
4 MICHIGAN 51837.0 269.978774
5 PENNSYLVANIA 48042.0 276.245915
6 IOWA 43741.0 274.823931
7 VIRGINIA 42440.0 274.988690
8 NORTH CAROLINA 38877.0 275.821518
9 OHIO 38492.0 278.812721
10 MASSACHUSETTS 34140.0 283.465559
11 ARIZONA 34048.0 269.656164
12 NEVADA 30938.0 265.177162
13 ALABAMA 25343.0 261.769896
14 SOUTH CAROLINA 21815.0 268.006371
15 COLORADO 21727.0 278.107690
16 UTAH 21405.0 281.790139
17 INDIANA 20332.0 278.185064
18 NEW MEXICO 18179.0 256.210854
19 OREGON 18157.0 267.940476
20 LOUISIANA 14579.0 266.783264
21 WASHINGTON 14276.0 273.665174
22 FLORIDA 14121.0 267.242908
23 WISCONSIN 12322.0 282.649634
24 IDAHO 11994.0 278.104479
25 MISSOURI 10067.0 270.363197
26 ILLINOIS 9845.0 277.416366
27 MINNESOTA 9657.0 282.087421
28 NEW YORK 8990.0 270.851724
29 KENTUCKY 8462.0 270.813300
30 MARYLAND 6948.0 268.213760
31 OKLAHOMA 6893.0 264.454031
32 NEW JERSEY 6222.0 281.684698
33 ALASKA 6100.0 263.962653
34 TENNESSEE 5783.0 275.982988
35 NORTH DAKOTA 4195.0 279.765445
36 KANSAS 4096.0 274.136728
37 NEW HAMPSHIRE 3556.0 279.744642
38 MISSISSIPPI 3246.0 269.060485
39 NEBRASKA 3136.0 279.883496
40 WEST VIRGINIA 2750.0 260.769078
41 SOUTH DAKOTA 2739.0 281.101137
42 MONTANA 2510.0 279.082096
43 DELAWARE 1693.0 263.057137
44 WYOMING 1566.0 278.315324
45 MAINE 1395.0 272.621883
46 CONNECTICUT 1297.0 276.674248
47 PUERTO RICO 1190.0 216.087041
48 VERMONT 1051.0 275.617661
49 HAWAII 813.0 270.041103
50 RHODE ISLAND 745.0 269.826789
51 DISTRICT OF COLUMBIA 218.0 261.538791

Visualizing Distance Education Enrollment and NAEP Math Performance

Same approach as the enrollment-size analysis: scatter plot with a Spearman correlation. Each point is one state.

distance_education_naep_df = pd.read_sql(distance_education_naep_query, engine)

distance_education_naep_df["total_distance_education_enrollment"] = pd.to_numeric(
    distance_education_naep_df["total_distance_education_enrollment"]
)
distance_education_naep_df["average_scale_score"] = pd.to_numeric(
    distance_education_naep_df["average_scale_score"]
).round(2)

# Spearman correlation between distance-ed enrollment and 2024 NAEP score.
rho_de, p_de = spearmanr(
    distance_education_naep_df["total_distance_education_enrollment"],
    distance_education_naep_df["average_scale_score"],
)

fig, ax = plt.subplots(figsize=(11, 6))
ax.scatter(
    distance_education_naep_df["total_distance_education_enrollment"],
    distance_education_naep_df["average_scale_score"],
    s=40,
)

# Annotate the most extreme states on each axis so the chart isn't anonymous.
de_extremes = pd.concat([
    distance_education_naep_df.nlargest(4, "total_distance_education_enrollment"),
    distance_education_naep_df.nlargest(2, "average_scale_score"),
    distance_education_naep_df.nsmallest(2, "average_scale_score"),
]).drop_duplicates(subset=["state_name"])

for _, row in de_extremes.iterrows():
    ax.annotate(
        row["state_name"].title(),
        (row["total_distance_education_enrollment"], row["average_scale_score"]),
        xytext=(5, 5), textcoords="offset points", fontsize=9,
    )

ax.set_title(
    f"Distance Education Enrollment vs. 2024 NAEP Math Score (Spearman ρ = {rho_de:.2f})"
)
ax.set_xlabel("State Distance Education Enrollment (CRDC, sentinel codes excluded)")
ax.set_ylabel("2024 NAEP Grade 8 Math — Average Scale Score")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Spearman: rho = {rho_de:.3f}, p-value = {p_de:.3f}")

Spearman: rho = -0.026, p-value = 0.856

Distance Education and NAEP Performance Interpretation

The headline number: Spearman ρ is near zero, p-value well above 0.05. There’s no meaningful rank correlation between state distance-education enrollment and 2024 NAEP Grade 8 math scores. Texas leads in raw distance-ed enrollment and sits below the median on NAEP; Massachusetts has middling distance-ed enrollment and tops the score chart. The relationship just isn’t there at the state aggregate level.

A few things worth being explicit about so this isn’t overclaimed:

  • This is correlation, not causation. Even a strong correlation here wouldn’t tell you whether distance education helps, hurts, or has no effect — districts adopting distance ed at scale differ from non-adopters in many ways beyond their technology choices.
  • State-level aggregates hide everything interesting underneath. A state with 100,000 distance-ed students could be 100 districts with 1,000 each, or 5 districts with 20,000 each. Whether the program “worked” is a district-level question that disappears in the rollup.
  • 2024 is one year. Distance education adoption surged 2020–2022 and has been receding since. A single-year snapshot can’t show that arc. Useful follow-up: trend fact_distance_education_adoption against NAEP year-over-year.
  • Distance-ed enrollment is not the same as technology access. A district can have universal device + broadband access and zero distance-ed enrollment. CRDC has fields closer to “access” (1:1 device programs, broadband infrastructure) that I haven’t pulled into the warehouse yet — that’s the natural next layer.

What this analysis does establish is the pipeline plumbing: every CRDC district can be joined to its state, its enrollment, its distance-ed participation, and its state’s NAEP scores in one SQL query. Once technology-access fields are added, the same join shape extends to answer the more specific question.

Final Technology Impact Analysis Conclusion

This project built an end-to-end data engineering and analytics workflow to ask a focused question: does technology-enabled learning participation in U.S. public schools associate with measurable differences in academic performance? The short answer, from the data integrated here, is no — not at the state aggregate level, and not in a way that survives basic statistical scrutiny.

The longer answer is that the question is harder than it looks. The most informative outcome of this project isn’t the correlation itself; it’s the warehouse and the pipeline that make the next, harder version of the question tractable. Once district-level technology access measures (1:1 device programs, broadband, instructional technology investment) are layered into the same fact-table structure, the same join shape — district → state → year → outcome — can answer a question that state-level distance-ed enrollment can’t.

What this iteration delivered:

  • Four federal datasets (CCD operational data, NAEP Grade 8 math, CRDC district characteristics, CRDC distance education) extracted, cleaned, and loaded into PostgreSQL
  • A dimensional warehouse (dim_states, dim_districts, dim_years, plus four fact tables) that supports state-level, district-level, and longitudinal analysis
  • A documented data-quality fix for the CRDC -9 sentinel-code issue — 65% of districts in the distance-ed file had suppressed values that produced negative state totals in earlier iterations; the corrected aggregation drops California’s total from understated-negative territory to ~98K students and Texas to ~103K
  • Honest statistical reporting — Spearman correlations with p-values rather than eyeballed bar charts, and scatter plots that don’t hide the underlying distribution
  • A reproducible report rendered with Quarto that ties code, results, and interpretation together in one place

What this iteration didn’t deliver, and which I’d want to add next:

  • District-level technology access measures from CRDC, not just distance-ed participation
  • Year-over-year trend modeling for the post-pandemic period using fact_distance_education_adoption over time
  • Tests on the ETL transforms — particularly around the sentinel-code handling, since that’s the bug class that bit me here

The skills exercised across the workflow: dimensional warehouse design in PostgreSQL, SQL analytics (CTEs, conditional aggregation, multi-source joins), data quality investigation and remediation, statistical reporting with appropriate caveats, and reproducible reporting with Quarto.

Live Tableau Dashboards

The warehouse and analysis in this report also power three interactive dashboards published on Tableau Public. They’re built on top of the same fact and dimension tables — the static charts here are a snapshot; the dashboards let you filter, hover, and explore the data yourself.