This project analyzes how education technology adoption may relate to public school performance across the United States.
The goal of this project is to build a portfolio-grade data engineering and analytics workflow that extracts, cleans, transforms, loads, and analyzes education-related datasets using Python, PostgreSQL, SQL, and Quarto.
I use this workflow to develop a reproducible analytics pipeline that can support long-term research into whether access to technology in schools is associated with measurable differences in educational outcomes.
Project Objectives
The main objectives of this project are to:
Extract raw public education, student outcome, and technology participation datasets from trusted national sources
Clean and standardize inconsistent public-sector data files
Build a PostgreSQL analytical warehouse using fact and dimension tables
Connect district, state, year, educational outcome, and technology adoption data through a dimensional model
Perform SQL-based analysis directly against PostgreSQL
Create visualizations and reporting outputs inside a reproducible Quarto HTML report
Evaluate how technology-enabled learning participation may relate to measurable student outcome trends
Data Sources
This project uses publicly available education and technology-related datasets, including:
Dataset
Source Purpose
CCD Common Core of Data
District and school operational information
NAEP National Assessment of Educational Progress
State-level academic performance outcomes
CRDC Civil Rights Data Collection — LEA Characteristics
District-level characteristics and enrollment information
CRDC Distance Education
Technology-enabled learning participation and distance education enrollment indicators
The completed workflow integrates CCD district data, NAEP Grade 8 mathematics performance data, CRDC district characteristics, and CRDC distance education participation metrics. Together, these datasets support analysis of how technology adoption and district-level conditions may relate to student outcome patterns.
Tools and Technologies
This project uses:
Tool
Purpose
Python
Data extraction, cleaning, transformation, and validation
pandas
Data manipulation and profiling
PostgreSQL
Analytical warehouse storage
SQL
Warehouse querying and analysis
Quarto
Reproducible HTML reporting
matplotlib
Data visualization
GitHub
Version control and portfolio documentation
Research Direction
The central research question behind this project is:
Does increased access to technology-enabled learning in public schools appear to be associated with measurable differences in academic performance across states, districts, and reporting years?
This project builds an end-to-end data engineering and analytics workflow to explore that question using integrated public education datasets, PostgreSQL warehouse modeling, SQL analysis, and reproducible reporting.
Project Setup
Before any data work, I load the libraries used throughout the report and connect to the local PostgreSQL warehouse. Credentials are read from a .env file (see .env.example for the variables you need to define locally). The .env file itself is gitignored — it never leaves the machine.
import osfrom pathlib import Pathimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom dotenv import load_dotenv, find_dotenvfrom sqlalchemy import create_enginefrom sqlalchemy.engine import URL# Load credentials from the local .env file.load_dotenv(find_dotenv(usecwd=True))DB_USER = os.getenv("DB_USER")DB_PASSWORD = os.getenv("DB_PASSWORD")DB_HOST = os.getenv("DB_HOST")DB_PORT = os.getenv("DB_PORT")DB_NAME = os.getenv("DB_NAME")# Fail fast if anything is missing — easier to debug than a cryptic error later.required = {"DB_USER": DB_USER, "DB_PASSWORD": DB_PASSWORD,"DB_HOST": DB_HOST, "DB_PORT": DB_PORT, "DB_NAME": DB_NAME}missing = [k for k, v in required.items() ifnot v]if missing:raiseRuntimeError(f"Missing environment variables: {missing}")# URL.create handles special characters in the password safely.database_url = URL.create( drivername="postgresql+psycopg2", username=DB_USER, password=DB_PASSWORD, host=DB_HOST, port=int(DB_PORT), database=DB_NAME,)engine = create_engine(database_url)# Analytical SQL queries live in sql/queries/ rather than being embedded# inline as Python strings. This helper reads one of those files.# The notebooks/ folder is one level under the repo root, so queries are# at ../sql/queries/<filename>.sql relative to this .qmd.SQL_QUERIES_DIR = Path("../sql/queries").resolve()def read_sql_file(filename: str) ->str:"""Read a SQL file from sql/queries/ and return its contents.""" path = SQL_QUERIES_DIR / filenameifnot path.exists():raiseFileNotFoundError(f"SQL file not found: {path}")return path.read_text(encoding="utf-8")# Quick connection sanity check.with engine.connect() as conn:print(f"Connected to PostgreSQL: {DB_NAME} @ {DB_HOST}:{DB_PORT}")
Connected to PostgreSQL: education_tech_impact @ localhost:5432
Data Extraction — CCD Dataset
In this step, I load the Common Core of Data (CCD) dataset into Python for initial inspection and preprocessing.
This dataset contains district-level information including enrollment, number of schools, and geographic details.
I use pandas to read the CSV file and examine its structure before loading it into PostgreSQL.
# Define the relative path to the CCD dataset.# The raw data is stored locally and excluded from GitHub using .gitignore.file_path ="../data/raw/ccd/ccd_lea_2024.csv"# Load the CCD CSV file into a pandas DataFrame.# This allows me to inspect the dataset before database loading.df_ccd = pd.read_csv(file_path)# Display the first five records to confirm the file loaded correctly.df_ccd.head()
C:\Users\User\AppData\Local\Temp\ipykernel_21316\1148928295.py:7: DtypeWarning: Columns (0: MSTREET3) have mixed types. Specify dtype option on import or set low_memory=False.
df_ccd = pd.read_csv(file_path)
SCHOOL_YEAR
FIPST
STATENAME
ST
LEA_NAME
STATE_AGENCY_NO
UNION
ST_LEAID
LEAID
MSTREET1
...
G_11_OFFERED
G_12_OFFERED
G_13_OFFERED
G_UG_OFFERED
G_AE_OFFERED
GSLO
GSHI
LEVEL
IGOFFERED
OPERATIONAL_SCHOOLS
0
2024-2025
1
ALABAMA
AL
Alabama Youth Services
1
NaN
AL-210
100002
1000 Industrial School Road
...
Yes
Yes
No
No
No
KG
12
Other
As reported
0
1
2024-2025
1
ALABAMA
AL
Albertville City
1
NaN
AL-101
100005
8379 US Highway 431
...
Yes
Yes
No
No
No
PK
12
Other
As reported
7
2
2024-2025
1
ALABAMA
AL
Marshall County
1
NaN
AL-048
100006
12380 US Highway 431 S
...
Yes
Yes
No
No
No
PK
12
Other
As reported
17
3
2024-2025
1
ALABAMA
AL
Hoover City
1
NaN
AL-158
100007
2810 Metropolitan Way
...
Yes
Yes
No
No
No
PK
12
Other
As reported
18
4
2024-2025
1
ALABAMA
AL
Madison City
1
NaN
AL-169
100008
211 Celtic Dr
...
Yes
Yes
No
No
No
PK
12
Other
As reported
13
5 rows × 58 columns
Initial Data Inspection — CCD Dataset
In this step, I inspect the CCD dataset to understand its size, structure, and key columns before performing any cleaning or transformation.
This inspection helps confirm that the data loaded correctly and provides an initial understanding of the fields available for analysis.
# Display the number of rows and columns in the CCD dataset.# This gives me a quick overview of the dataset size before cleaning.row_count, column_count = df_ccd.shapeprint(f"Number of rows: {row_count}")print(f"Number of columns: {column_count}")# Display all column names so I can identify fields needed for the district dimension table.df_ccd.columns.tolist()
In this step, I identify the key district-level fields that will be used to build the district dimension table in PostgreSQL.
The CCD dataset contains many administrative and metadata fields. For the initial data model, I focus on the core district identification and geographic columns needed for analysis.
# Select district-related columns needed for the dimension table.district_columns = ["LEAID","LEA_NAME","ST","STATENAME","MCITY","WEBSITE","OPERATIONAL_SCHOOLS"]# Create a smaller district-focused DataFrame.df_districts = df_ccd[district_columns]# Display the first five district records.df_districts.head()
LEAID
LEA_NAME
ST
STATENAME
MCITY
WEBSITE
OPERATIONAL_SCHOOLS
0
100002
Alabama Youth Services
AL
ALABAMA
Mt Meigs
http://www.dys.alabama.gov/school-district.html
0
1
100005
Albertville City
AL
ALABAMA
Albertville
http://www.albertk12.org
7
2
100006
Marshall County
AL
ALABAMA
Guntersville
http://www.marshallk12.org
17
3
100007
Hoover City
AL
ALABAMA
Hoover
http://www.hoovercityschools.net
18
4
100008
Madison City
AL
ALABAMA
Madison
http://www.madisoncity.k12.al.us
13
Cleaning the District Dimension Dataset
Before loading the district data into PostgreSQL, I clean the dataset to remove duplicate districts and standardize the column structure.
This step helps ensure the dimension table contains one unique record per school district.
# Create a copy of the district DataFrame to preserve the original dataset.df_district_dim = df_districts.copy()# Remove duplicate district records using the LEAID identifier.df_district_dim = df_district_dim.drop_duplicates(subset=["LEAID"])# Standardize column names for PostgreSQL compatibility.df_district_dim.columns = ["district_id","district_name","state_code","state_name","city","website","operational_schools"]# Reset the index after cleaning so the DataFrame has a clean sequence.df_district_dim = df_district_dim.reset_index(drop=True)# Display the cleaned district dimension dataset.df_district_dim.head()
district_id
district_name
state_code
state_name
city
website
operational_schools
0
100002
Alabama Youth Services
AL
ALABAMA
Mt Meigs
http://www.dys.alabama.gov/school-district.html
0
1
100005
Albertville City
AL
ALABAMA
Albertville
http://www.albertk12.org
7
2
100006
Marshall County
AL
ALABAMA
Guntersville
http://www.marshallk12.org
17
3
100007
Hoover City
AL
ALABAMA
Hoover
http://www.hoovercityschools.net
18
4
100008
Madison City
AL
ALABAMA
Madison
http://www.madisoncity.k12.al.us
13
Loading the District Dimension Table into PostgreSQL
After cleaning the district dataset, I load the data into PostgreSQL as a dimension table.
This step creates the foundation for future analytics and fact tables that will be connected through district identifiers.
# Load the cleaned district dimension dataset into PostgreSQL.# if_exists="replace" allows this workflow to be rerun during development# without creating duplicate tables or duplicate records.df_district_dim.to_sql( name="dim_districts", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("District dimension table loaded successfully into PostgreSQL.")
District dimension table loaded successfully into PostgreSQL.
Verifying the PostgreSQL Target Database
Before continuing, I verify which PostgreSQL database the workflow is connected to and confirm where the table was created.
# Query PostgreSQL to confirm the active database, schema, and user.connection_check_query ="""SELECT current_database() AS active_database, current_schema() AS active_schema, current_user AS active_user;"""# Display the PostgreSQL connection details.pd.read_sql(connection_check_query, engine)
active_database
active_schema
active_user
0
education_tech_impact
public
postgres
Verifying the District Dimension Table in PostgreSQL
After loading the district dimension table, I verify that the table exists inside PostgreSQL.
# Query PostgreSQL to confirm the dimension table exists.table_check_query ="""SELECT table_schema, table_nameFROM information_schema.tablesWHERE table_schema = 'public' AND table_name = 'dim_districts';"""# Display matching PostgreSQL tables.pd.read_sql(table_check_query, engine)
table_schema
table_name
0
public
dim_districts
Counting Records in the District Dimension Table
To confirm the table was loaded correctly, I query PostgreSQL and count the total number of district records stored in the dimension table.
# Count records inside the PostgreSQL dimension table.district_count_query ="""SELECT COUNT(*) AS total_district_recordsFROM public.dim_districts;"""# Display the total number of district records.pd.read_sql(district_count_query, engine)
total_district_records
0
19630
Previewing Records from the District Dimension Table
As a final validation step, I preview the first few records directly from the PostgreSQL table.
# Preview records directly from the PostgreSQL dimension table.district_preview_query ="""SELECT district_id, district_name, state_code, state_name, city, operational_schoolsFROM public.dim_districtsLIMIT 5;"""# Display the first five records from PostgreSQL.pd.read_sql(district_preview_query, engine)
district_id
district_name
state_code
state_name
city
operational_schools
0
100002
Alabama Youth Services
AL
ALABAMA
Mt Meigs
0
1
100005
Albertville City
AL
ALABAMA
Albertville
7
2
100006
Marshall County
AL
ALABAMA
Guntersville
17
3
100007
Hoover City
AL
ALABAMA
Hoover
18
4
100008
Madison City
AL
ALABAMA
Madison
13
Creating the State Dimension Table
To support state-level analysis, I create a separate state dimension table from the cleaned district dataset.
This table stores one unique record per state and allows future fact tables to connect district-level data to state-level summaries.
# Create the state dimension table from the cleaned district dataset.df_state_dim = ( df_district_dim[["state_code", "state_name"]] .drop_duplicates() .sort_values(by="state_code") .reset_index(drop=True))# Add a numeric state key for PostgreSQL modeling.df_state_dim.insert(0, "state_key", range(1, len(df_state_dim) +1))# Display the state dimension table.df_state_dim.head()
state_key
state_code
state_name
0
1
AK
ALASKA
1
2
AL
ALABAMA
2
3
AR
ARKANSAS
3
4
AS
AMERICAN SAMOA
4
5
AZ
ARIZONA
Loading the State Dimension Table into PostgreSQL
After creating the state dimension dataset, I load the table into PostgreSQL for future state-level analytics and reporting.
# Load the state dimension table into PostgreSQL.df_state_dim.to_sql( name="dim_states", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("State dimension table loaded successfully into PostgreSQL.")
State dimension table loaded successfully into PostgreSQL.
Verifying the State Dimension Table in PostgreSQL
To confirm the state dimension table loaded successfully, I query PostgreSQL and count the total number of state records.
state_count_query ="""SELECT COUNT(*) AS total_state_recordsFROM public.dim_states;"""pd.read_sql(state_count_query, engine)
total_state_records
0
57
Creating the District School Counts Fact Table
To support analytical reporting, I create the first fact table using district-level school count data.
This table connects each district to its state and stores the number of operational schools reported in the CCD dataset.
# Create a fact table for district-level school counts.fact_district_school_counts = df_district_dim[ ["district_id","state_code","operational_schools" ]].copy()# Rename the school count column to make the metric purpose clear.fact_district_school_counts = fact_district_school_counts.rename( columns={"operational_schools": "operational_school_count"})# Display the first five records from the fact table.fact_district_school_counts.head()
district_id
state_code
operational_school_count
0
100002
AL
0
1
100005
AL
7
2
100006
AL
17
3
100007
AL
18
4
100008
AL
13
Loading the District School Counts Fact Table into PostgreSQL
After creating the district school counts fact table, I load it into PostgreSQL so it can be queried with SQL and used for future dashboard analysis.
# Load the district school counts fact table into PostgreSQL.fact_district_school_counts.to_sql( name="fact_district_school_counts", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("District school counts fact table loaded successfully into PostgreSQL.")
District school counts fact table loaded successfully into PostgreSQL.
Analyzing States with the Highest Number of Operational Schools
To begin the analytical portion of the project, I query PostgreSQL to identify which states have the highest total number of operational schools.
This query joins the state dimension table with the district school counts fact table and aggregates the results at the state level.
# SQL query to identify states with the highest operational school counts.top_states_query = read_sql_file("top_states_by_operational_schools.sql")# Execute the SQL query and display the results.pd.read_sql(top_states_query, engine)
state_name
total_operational_schools
0
CALIFORNIA
10303.0
1
TEXAS
9192.0
2
NEW YORK
4810.0
3
ILLINOIS
4398.0
4
FLORIDA
4186.0
5
OHIO
3543.0
6
MICHIGAN
3468.0
7
PENNSYLVANIA
2932.0
8
MINNESOTA
2741.0
9
NORTH CAROLINA
2705.0
Visualizing the Top 10 States by Operational Schools
To make the SQL analysis easier to interpret, I create a bar chart showing the top 10 states by total operational school count.
This visualization helps translate the PostgreSQL query results into a clear portfolio-ready analytical output.
# Store the SQL query result in a DataFrame for visualization.top_states_df = pd.read_sql(top_states_query, engine)# Create a horizontal bar chart for readability.plt.figure(figsize=(10, 6))plt.barh( top_states_df["state_name"], top_states_df["total_operational_schools"])# Invert the y-axis so the state with the highest count appears at the top.plt.gca().invert_yaxis()# Add chart labels and title.plt.title("Top 10 States by Operational School Count")plt.xlabel("Total Operational Schools")plt.ylabel("State")# Improve spacing so labels do not overlap.plt.subplots_adjust(left=0.28)# Display the chart in the Quarto HTML report.plt.show()
Analyzing Average Operational Schools per District by State
To compare state-level district structures, I calculate the average number of operational schools per district for each state.
This helps show which states tend to have larger or smaller district-level school systems.
# SQL query to calculate the average number of operational schools per district by state.avg_schools_per_district_query = read_sql_file("avg_schools_per_district_by_state.sql")# Execute the SQL query and display the results.pd.read_sql(avg_schools_per_district_query, engine)
state_name
avg_operational_schools_per_district
0
PUERTO RICO
862.00
1
HAWAII
296.00
2
MARYLAND
56.32
3
FLORIDA
49.83
4
NEVADA
36.75
5
NORTHERN MARIANAS
35.00
6
AMERICAN SAMOA
28.00
7
SOUTH CAROLINA
13.30
8
TENNESSEE
12.84
9
VIRGINIA
10.17
Visualizing Average Operational Schools per District by State
To better interpret the state-level district averages, I create a second visualization showing the average number of operational schools per district.
# Store the SQL query result in a DataFrame for visualization.avg_schools_df = pd.read_sql(avg_schools_per_district_query, engine)# Create a horizontal bar chart.plt.figure(figsize=(10, 6))plt.barh( avg_schools_df["state_name"], avg_schools_df["avg_operational_schools_per_district"])# Display the highest averages at the top.plt.gca().invert_yaxis()# Add chart labels and title.plt.title("Average Operational Schools per District by State")plt.xlabel("Average Operational Schools")plt.ylabel("State")# Improve chart spacing.plt.subplots_adjust(left=0.28)# Display the visualization.plt.show()
Tableau Dashboard Development Plan
The core data engineering and analytical workflow is now complete. The remaining presentation layer for this project is Tableau dashboard development.
The Tableau dashboard will use the cleaned warehouse outputs and focus on communicating the project’s main findings clearly for recruiters, hiring managers, and interview discussions.
NAEP Grade 8 mathematics performance by state
Post-pandemic score changes between 2019 and 2024
Distance education adoption and participation by state
Enrollment size compared with educational performance
Technology-enabled learning participation compared with student outcome measures
This final dashboard layer will summarize the technical workflow in an executive-style visual format while the Quarto report remains the reproducible data engineering documentation.
Current PostgreSQL Warehouse Schema
At this stage of the project, the PostgreSQL warehouse contains dimension and fact tables supporting operational education data, student outcome analysis, district characteristics, and distance education technology adoption metrics.
Table Name
Description
dim_districts
Stores district-level identification and geographic information
dim_states
Stores state-level identifiers and surrogate state keys
dim_years
Stores reporting years for longitudinal analysis
dim_crdc_district_characteristics
Stores CRDC district characteristics, enrollment, and district-level context
Table Name
Description
fact_district_school_counts
Stores operational school counts by district and state
fact_district_school_counts_enhanced
Stores school counts using surrogate state keys for enhanced warehouse modeling
fact_naep_math_performance
Stores NAEP Grade 8 mathematics performance outcomes by state and reporting year
fact_distance_education_adoption
Stores CRDC distance education adoption and participation metrics by district
This warehouse structure supports multi-source educational analytics, longitudinal student outcome analysis, distance education adoption reporting, and Tableau dashboard development.
Data Model Relationships
The warehouse model connects operational, outcome, district characteristic, and technology adoption data through shared state, district, and year identifiers.
Fact / Analytical Table
Related Dimension / Dataset
Join Key
fact_district_school_counts
dim_districts
district_id
fact_district_school_counts_enhanced
dim_states
state_key
fact_naep_math_performance
dim_states
state_key
fact_naep_math_performance
dim_years
year_key
dim_crdc_district_characteristics
fact_distance_education_adoption
district_id
fact_distance_education_adoption
NAEP state-level analytical outputs
state_name
This design allows technology adoption metrics, district characteristics, enrollment scale, and student outcome measures to be analyzed together in a consistent warehouse-driven workflow.
Data Quality Validation Checks
Before using the warehouse tables for analysis, I perform several basic data quality checks to validate the integrity of the transformed data.
# Count missing district identifiers.missing_district_ids = df_district_dim["district_id"].isnull().sum()# Count missing state codes.missing_state_codes = df_district_dim["state_code"].isnull().sum()# Count duplicate district identifiers.duplicate_district_ids = df_district_dim["district_id"].duplicated().sum()# Display validation results.validation_results = pd.DataFrame({"validation_check": ["Missing district IDs","Missing state codes","Duplicate district IDs" ],"result": [ missing_district_ids, missing_state_codes, duplicate_district_ids ]})validation_results
validation_check
result
0
Missing district IDs
0
1
Missing state codes
0
2
Duplicate district IDs
0
Creating the Year Dimension Table
To support longitudinal and multi-year analysis, I create a year dimension table.
This table connects NAEP reporting years and future multi-year education datasets through a consistent warehouse reporting year structure.
# Define the reporting years used in the project.analysis_years = [2009,2011,2013,2015,2017,2019,2022,2024]# Create the year dimension table.df_year_dim = pd.DataFrame({"year_key": range(1, len(analysis_years) +1),"reporting_year": analysis_years})# Display the year dimension table.df_year_dim
year_key
reporting_year
0
1
2009
1
2
2011
2
3
2013
3
4
2015
4
5
2017
5
6
2019
6
7
2022
7
8
2024
Loading the Year Dimension Table into PostgreSQL
After creating the year dimension table, I load it into PostgreSQL so future CCD, NAEP, and CRDC datasets can connect to a consistent reporting year structure.
# Load the year dimension table into PostgreSQL.df_year_dim.to_sql( name="dim_years", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("Year dimension table loaded successfully into PostgreSQL.")
Year dimension table loaded successfully into PostgreSQL.
Verifying the Year Dimension Table in PostgreSQL
To confirm the year dimension table loaded correctly, I query PostgreSQL and display the stored reporting years.
# Query PostgreSQL to display the year dimension table.year_dimension_query ="""SELECT *FROM public.dim_yearsORDER BY reporting_year;"""# Display the year dimension records.pd.read_sql(year_dimension_query, engine)
year_key
reporting_year
0
1
2009
1
2
2011
2
3
2013
3
4
2015
4
5
2017
5
6
2019
6
7
2022
7
8
2024
Adding a Surrogate Key to the State Dimension
To improve warehouse design and prepare for future multi-dataset integration, I create a surrogate key for the state dimension table.
Creating the State Dimension Table with Surrogate Keys
To support future warehouse joins and multi-dataset integration, I create a dedicated state dimension table with surrogate integer keys.
# Create a unique state dimension table from the district dimension dataset.df_states_dim = ( df_district_dim[ ["state_code", "state_name"] ] .drop_duplicates() .sort_values("state_name") .reset_index(drop=True))# Add a surrogate integer key for each state.df_states_dim.insert(0,"state_key",range(1, len(df_states_dim) +1))# Display the state dimension table.df_states_dim.head()
state_key
state_code
state_name
0
1
AL
ALABAMA
1
2
AK
ALASKA
2
3
AS
AMERICAN SAMOA
3
4
AZ
ARIZONA
4
5
AR
ARKANSAS
Loading the Enhanced State Dimension into PostgreSQL
After adding surrogate keys to the state dimension table, I load the enhanced warehouse dimension into PostgreSQL.
# Load the enhanced state dimension table into PostgreSQL.df_states_dim.to_sql( name="dim_states", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("Enhanced state dimension table loaded successfully into PostgreSQL.")
Enhanced state dimension table loaded successfully into PostgreSQL.
Verifying the Enhanced State Dimension in PostgreSQL
To confirm the enhanced state dimension table loaded correctly, I query PostgreSQL and display the stored surrogate keys and state records.
# Query PostgreSQL to display the enhanced state dimension table.state_dimension_query ="""SELECT *FROM public.dim_statesORDER BY state_key;"""# Display the PostgreSQL state dimension table.pd.read_sql(state_dimension_query, engine)
state_key
state_code
state_name
0
1
AL
ALABAMA
1
2
AK
ALASKA
2
3
AS
AMERICAN SAMOA
3
4
AZ
ARIZONA
4
5
AR
ARKANSAS
5
6
BI
BUREAU OF INDIAN EDUCATION
6
7
CA
CALIFORNIA
7
8
CO
COLORADO
8
9
CT
CONNECTICUT
9
10
DE
DELAWARE
10
11
DC
DISTRICT OF COLUMBIA
11
12
FL
FLORIDA
12
13
GA
GEORGIA
13
14
GU
GUAM
14
15
HI
HAWAII
15
16
ID
IDAHO
16
17
IL
ILLINOIS
17
18
IN
INDIANA
18
19
IA
IOWA
19
20
KS
KANSAS
20
21
KY
KENTUCKY
21
22
LA
LOUISIANA
22
23
ME
MAINE
23
24
MD
MARYLAND
24
25
MA
MASSACHUSETTS
25
26
MI
MICHIGAN
26
27
MN
MINNESOTA
27
28
MS
MISSISSIPPI
28
29
MO
MISSOURI
29
30
MT
MONTANA
30
31
NE
NEBRASKA
31
32
NV
NEVADA
32
33
NH
NEW HAMPSHIRE
33
34
NJ
NEW JERSEY
34
35
NM
NEW MEXICO
35
36
NY
NEW YORK
36
37
NC
NORTH CAROLINA
37
38
ND
NORTH DAKOTA
38
39
MP
NORTHERN MARIANAS
39
40
OH
OHIO
40
41
OK
OKLAHOMA
41
42
OR
OREGON
42
43
PA
PENNSYLVANIA
43
44
PR
PUERTO RICO
44
45
RI
RHODE ISLAND
45
46
SC
SOUTH CAROLINA
46
47
SD
SOUTH DAKOTA
47
48
TN
TENNESSEE
48
49
TX
TEXAS
49
50
VI
U.S. VIRGIN ISLANDS
50
51
UT
UTAH
51
52
VT
VERMONT
52
53
VA
VIRGINIA
53
54
WA
WASHINGTON
54
55
WV
WEST VIRGINIA
55
56
WI
WISCONSIN
56
57
WY
WYOMING
Creating an Enhanced Fact Table with Surrogate Keys
To improve warehouse normalization and prepare for future multi-dataset integration, I create an enhanced fact table that connects to the state dimension using surrogate keys.
# Merge the existing district school counts fact table with the enhanced state dimension.df_fact_enhanced = fact_district_school_counts.merge( df_states_dim[ ["state_key", "state_code"] ], on="state_code", how="left")# Reorder columns for warehouse modeling.df_fact_enhanced = df_fact_enhanced[ ["district_id","state_key","state_code","operational_school_count" ]]# Display the enhanced fact table.df_fact_enhanced.head()
district_id
state_key
state_code
operational_school_count
0
100002
1
AL
0
1
100005
1
AL
7
2
100006
1
AL
17
3
100007
1
AL
18
4
100008
1
AL
13
Loading the Enhanced District School Counts Fact Table into PostgreSQL
After adding the state surrogate key to the fact table, I load the enhanced fact table into PostgreSQL for improved warehouse-style analytics.
# Load the enhanced district school counts fact table into PostgreSQL.df_fact_enhanced.to_sql( name="fact_district_school_counts_enhanced", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("Enhanced district school counts fact table loaded successfully into PostgreSQL.")
Enhanced district school counts fact table loaded successfully into PostgreSQL.
Verifying the Enhanced Fact Table in PostgreSQL
To confirm the enhanced fact table loaded correctly, I query PostgreSQL and display the warehouse-ready records with surrogate state keys.
# Query PostgreSQL to display the enhanced fact table.enhanced_fact_query ="""SELECT *FROM public.fact_district_school_counts_enhancedLIMIT 10;"""# Display the enhanced PostgreSQL fact table.pd.read_sql(enhanced_fact_query, engine)
district_id
state_key
state_code
operational_school_count
0
100002
1
AL
0
1
100005
1
AL
7
2
100006
1
AL
17
3
100007
1
AL
18
4
100008
1
AL
13
5
100009
1
AL
5
6
100010
1
AL
0
7
100011
1
AL
4
8
100012
1
AL
5
9
100013
1
AL
5
Analyzing Operational Schools Using the Enhanced Warehouse Model
To validate the enhanced warehouse model, I run an analytical SQL query using the enhanced fact table and the state dimension table.
This confirms that the surrogate-key relationship supports state-level aggregation correctly.
# SQL query using the enhanced fact table and state surrogate key relationship.enhanced_state_analysis_query = read_sql_file("state_school_counts_with_surrogate_keys.sql")# Display the top states using the enhanced warehouse model.pd.read_sql(enhanced_state_analysis_query, engine)
state_name
total_operational_schools
0
CALIFORNIA
10303.0
1
TEXAS
9192.0
2
NEW YORK
4810.0
3
ILLINOIS
4398.0
4
FLORIDA
4186.0
5
OHIO
3543.0
6
MICHIGAN
3468.0
7
PENNSYLVANIA
2932.0
8
MINNESOTA
2741.0
9
NORTH CAROLINA
2705.0
Visualizing Operational Schools Using the Enhanced Warehouse Model
To make the enhanced warehouse analysis easier to interpret, I create a visualization using the surrogate-key fact table and state dimension relationship.
# Store the enhanced SQL analysis result in a DataFrame.enhanced_state_analysis_df = pd.read_sql(enhanced_state_analysis_query, engine)# Create a horizontal bar chart.plt.figure(figsize=(10, 6))plt.barh( enhanced_state_analysis_df["state_name"], enhanced_state_analysis_df["total_operational_schools"])# Display the highest values at the top.plt.gca().invert_yaxis()# Add chart labels and title.plt.title("Top 10 States by Operational Schools - Enhanced Warehouse Model")plt.xlabel("Total Operational Schools")plt.ylabel("State")# Improve chart spacing.plt.subplots_adjust(left=0.28)# Display the visualization.plt.show()
Exploring the NAEP Dataset Structure
Before integrating NAEP data into the warehouse, I first inspect the structure of the raw NAEP files.
This helps identify available columns, reporting formats, and educational metrics across years.
# Load the NAEP dataset using the correct header row.naep_clean_df = pd.read_excel("../data/raw/naep/naep_math_grade8_state_2009_2024.Xls", header=8)# Display the cleaned NAEP dataset.naep_clean_df.head()
Year
Jurisdiction
All students
Average scale score
0
2024
National
All students
273.833456
1
2024
Alabama
All students
261.769896
2
2024
Alaska
All students
263.962653
3
2024
Arizona
All students
269.656164
4
2024
Arkansas
All students
266.194834
Cleaning and Standardizing the NAEP Dataset
To prepare the NAEP dataset for warehouse integration, I standardize column names and select the core analytical fields.
# Standardize NAEP column names.naep_clean_df.columns = ["reporting_year","jurisdiction","student_group","average_scale_score"]# Keep only the required analytical columns.naep_clean_df = naep_clean_df[ ["reporting_year","jurisdiction","student_group","average_scale_score" ]]# Display the cleaned NAEP dataset.naep_clean_df.head()
reporting_year
jurisdiction
student_group
average_scale_score
0
2024
National
All students
273.833456
1
2024
Alabama
All students
261.769896
2
2024
Alaska
All students
263.962653
3
2024
Arizona
All students
269.656164
4
2024
Arkansas
All students
266.194834
Validating the NAEP Dataset
Before integrating NAEP data into the warehouse, I perform several validation checks to confirm data quality and completeness.
To improve warehouse data quality, I remove incomplete NAEP records containing missing analytical values.
# Remove rows with missing critical analytical values.naep_clean_df = naep_clean_df.dropna( subset=["reporting_year","jurisdiction","student_group","average_scale_score" ])# Reset the index after cleaning.naep_clean_df = naep_clean_df.reset_index(drop=True)# Display the cleaned NAEP dataset.naep_clean_df.head()
reporting_year
jurisdiction
student_group
average_scale_score
0
2024
National
All students
273.833456
1
2024
Alabama
All students
261.769896
2
2024
Alaska
All students
263.962653
3
2024
Arizona
All students
269.656164
4
2024
Arkansas
All students
266.194834
Connecting NAEP Records to the Year Dimension
To prepare NAEP data for warehouse integration, I connect each NAEP record to the year dimension using the reporting year.
# Merge NAEP records with the year dimension to add year_key.naep_with_year_key = naep_clean_df.merge( df_year_dim, on="reporting_year", how="left")# Display NAEP records with year dimension keys.naep_with_year_key.head()
reporting_year
jurisdiction
student_group
average_scale_score
year_key
0
2024
National
All students
273.833456
8.0
1
2024
Alabama
All students
261.769896
8.0
2
2024
Alaska
All students
263.962653
8.0
3
2024
Arizona
All students
269.656164
8.0
4
2024
Arkansas
All students
266.194834
8.0
Connecting NAEP Records to the State Dimension
To integrate NAEP educational outcomes into the warehouse model, I connect NAEP jurisdictions to the state dimension table.
# Merge NAEP records with the state dimension table.naep_warehouse_df = naep_with_year_key.merge( df_states_dim[ ["state_key", "state_name"] ], left_on="jurisdiction", right_on="state_name", how="left")# Display NAEP records connected to warehouse state keys.naep_warehouse_df.head()
reporting_year
jurisdiction
student_group
average_scale_score
year_key
state_key
state_name
0
2024
National
All students
273.833456
8.0
NaN
NaN
1
2024
Alabama
All students
261.769896
8.0
NaN
NaN
2
2024
Alaska
All students
263.962653
8.0
NaN
NaN
3
2024
Arizona
All students
269.656164
8.0
NaN
NaN
4
2024
Arkansas
All students
266.194834
8.0
NaN
NaN
Standardizing NAEP Jurisdiction Names for Warehouse Joins
Before joining NAEP records to the state dimension, I standardize jurisdiction names so they match the uppercase format used in the warehouse dimension table.
# Create a standardized uppercase jurisdiction field for warehouse joins.naep_with_year_key["state_name"] = naep_with_year_key["jurisdiction"].str.upper()# Merge NAEP records with the state dimension table using standardized state names.naep_warehouse_df = naep_with_year_key.merge( df_states_dim[ ["state_key", "state_name"] ], on="state_name", how="left")# Display NAEP records connected to warehouse state keys.naep_warehouse_df.head()
reporting_year
jurisdiction
student_group
average_scale_score
year_key
state_name
state_key
0
2024
National
All students
273.833456
8.0
NATIONAL
NaN
1
2024
Alabama
All students
261.769896
8.0
ALABAMA
1.0
2
2024
Alaska
All students
263.962653
8.0
ALASKA
2.0
3
2024
Arizona
All students
269.656164
8.0
ARIZONA
4.0
4
2024
Arkansas
All students
266.194834
8.0
ARKANSAS
5.0
Creating the NAEP Math Performance Fact Table
After connecting NAEP records to the year and state dimensions, I create a warehouse-ready fact table for Grade 8 mathematics performance.
This table stores state-level NAEP math scores by reporting year and connects to the warehouse using surrogate keys.
# Create a NAEP math performance fact table using warehouse surrogate keys.fact_naep_math_performance = naep_warehouse_df[ ["state_key","year_key","jurisdiction","student_group","average_scale_score" ]].copy()# Remove records that did not match the state dimension.# This excludes national aggregate rows and other non-state records.fact_naep_math_performance = fact_naep_math_performance.dropna( subset=["state_key","year_key","average_scale_score" ])# Convert surrogate keys to integers after removing missing values.fact_naep_math_performance["state_key"] = fact_naep_math_performance["state_key"].astype(int)fact_naep_math_performance["year_key"] = fact_naep_math_performance["year_key"].astype(int)# Display the NAEP math performance fact table.fact_naep_math_performance.head()
state_key
year_key
jurisdiction
student_group
average_scale_score
1
1
8
Alabama
All students
261.769896
2
2
8
Alaska
All students
263.962653
3
4
8
Arizona
All students
269.656164
4
5
8
Arkansas
All students
266.194834
5
7
8
California
All students
268.779818
Loading the NAEP Math Performance Fact Table into PostgreSQL
After creating the NAEP math performance fact table, I load it into PostgreSQL so it can be queried with SQL and used for longitudinal educational outcome analysis.
# Load the NAEP math performance fact table into PostgreSQL.fact_naep_math_performance.to_sql( name="fact_naep_math_performance", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("NAEP math performance fact table loaded successfully into PostgreSQL.")
NAEP math performance fact table loaded successfully into PostgreSQL.
Verifying the NAEP Math Performance Fact Table in PostgreSQL
To confirm the NAEP math performance fact table loaded correctly, I query PostgreSQL and preview the stored educational outcome records.
# Query PostgreSQL to preview the NAEP math performance fact table.naep_math_preview_query ="""SELECT *FROM public.fact_naep_math_performanceLIMIT 10;"""# Display the NAEP math performance records.pd.read_sql(naep_math_preview_query, engine)
state_key
year_key
jurisdiction
student_group
average_scale_score
0
1
8
Alabama
All students
261.769896296059
1
2
8
Alaska
All students
263.962652757225
2
4
8
Arizona
All students
269.656164092934
3
5
8
Arkansas
All students
266.194833728906
4
7
8
California
All students
268.779817602202
5
8
8
Colorado
All students
278.107689530909
6
9
8
Connecticut
All students
276.674248483695
7
10
8
Delaware
All students
263.057137020663
8
11
8
District of Columbia
All students
261.538791377155
9
12
8
Florida
All students
267.242907869627
Analyzing Top States by NAEP Grade 8 Math Performance
To begin analyzing educational outcomes, I query PostgreSQL to identify the top 10 states by Grade 8 NAEP mathematics average scale score.
# SQL query to identify the top 10 states by NAEP Grade 8 math performance.top_naep_math_states_query = read_sql_file("naep_top_states_2024.sql")# Display the top 10 states by NAEP Grade 8 math score.pd.read_sql(top_naep_math_states_query, engine)
state_name
reporting_year
average_scale_score
0
MASSACHUSETTS
2024
283.465558985794
1
WISCONSIN
2024
282.649633523366
2
MINNESOTA
2024
282.087420612381
3
UTAH
2024
281.79013940125
4
NEW JERSEY
2024
281.684697993032
5
SOUTH DAKOTA
2024
281.101137171665
6
NEBRASKA
2024
279.8834957478
7
NORTH DAKOTA
2024
279.765445425712
8
NEW HAMPSHIRE
2024
279.7446422881
9
MONTANA
2024
279.082095551632
Visualizing Top NAEP Grade 8 Math Performance States
To better interpret educational outcome performance, I create a visualization showing the top 10 states by 2024 NAEP Grade 8 mathematics average scale score.
# Store the SQL query result in a DataFrame.top_naep_math_states_df = pd.read_sql( top_naep_math_states_query, engine)# Convert scores to numeric values and round for readability.top_naep_math_states_df["average_scale_score"] = pd.to_numeric( top_naep_math_states_df["average_scale_score"]).round(2)# Sort values so the highest-performing states appear at the top.top_naep_math_states_df = top_naep_math_states_df.sort_values( by="average_scale_score", ascending=True)# Set a focused baseline so differences between close NAEP scores are visible.score_baseline =260# Create the chart.plt.figure(figsize=(14, 7))plt.barh( top_naep_math_states_df["state_name"], top_naep_math_states_df["average_scale_score"] - score_baseline, left=score_baseline)# Add score labels at the end of each bar.for index, value inenumerate(top_naep_math_states_df["average_scale_score"]): plt.text( value +0.15, index,f"{value:.2f}", va="center" )# Force the state names to display on the y-axis.plt.yticks(range(len(top_naep_math_states_df)), top_naep_math_states_df["state_name"])# Focus the x-axis range for better comparison visibility.plt.xlim(278, 285)# Add chart title and labels.plt.title("Top 10 States by 2024 NAEP Grade 8 Math Scores")plt.xlabel("Average Scale Score")plt.ylabel("State")# Add enough left margin so state names do not get clipped.plt.subplots_adjust(left=0.28)# Display chart.plt.show()
Analyzing Lowest States by NAEP Grade 8 Math Performance
To better understand educational performance variation across states, I also analyze the lowest-performing states by 2024 NAEP Grade 8 mathematics average scale score.
# SQL query to identify the lowest 10 states by NAEP Grade 8 math performance.lowest_naep_math_states_query = read_sql_file("naep_lowest_states_2024.sql")# Display the lowest-performing states by NAEP Grade 8 math score.pd.read_sql(lowest_naep_math_states_query, engine)
state_name
reporting_year
average_scale_score
0
PUERTO RICO
2024
216.087040632243
1
NEW MEXICO
2024
256.210854081077
2
WEST VIRGINIA
2024
260.769078185284
3
DISTRICT OF COLUMBIA
2024
261.538791377155
4
ALABAMA
2024
261.769896296059
5
DELAWARE
2024
263.057137020663
6
ALASKA
2024
263.962652757225
7
OKLAHOMA
2024
264.454031154883
8
NEVADA
2024
265.177161678326
9
ARKANSAS
2024
266.194833728906
Visualizing Lowest NAEP Grade 8 Math Performance States
To better understand lower-performing educational outcome trends, I create a visualization showing the lowest-performing states by 2024 NAEP Grade 8 mathematics average scale score. Puerto Rico appears in the NAEP source data but is excluded from this state-level visualization because the chart focuses on U.S. states only.
# Store the SQL query result in a DataFrame.lowest_naep_math_states_df = pd.read_sql( lowest_naep_math_states_query, engine)# Convert scores to numeric values and round for readability.lowest_naep_math_states_df["average_scale_score"] = pd.to_numeric( lowest_naep_math_states_df["average_scale_score"]).round(2)# Remove Puerto Rico so the visualization scale remains readable.lowest_naep_math_states_df = lowest_naep_math_states_df[ lowest_naep_math_states_df["state_name"] !="PUERTO RICO"]# Sort values for visualization.lowest_naep_math_states_df = lowest_naep_math_states_df.sort_values( by="average_scale_score", ascending=True)# Create the chart.plt.figure(figsize=(14, 7))plt.barh( lowest_naep_math_states_df["state_name"], lowest_naep_math_states_df["average_scale_score"])# Add score labels.for index, value inenumerate(lowest_naep_math_states_df["average_scale_score"]): plt.text( value +0.1, index,f"{value:.2f}", va="center" )# Focus the x-axis range.plt.xlim(255, 267)# Add chart labels and title.plt.title("Lowest Performing States by 2024 NAEP Grade 8 Math Scores")plt.xlabel("Average Scale Score")plt.ylabel("State")# Improve spacing so labels fit properly.plt.subplots_adjust(left=0.28)# Display chart.plt.show()
Analyzing NAEP Grade 8 Math Trends Over Time
To begin longitudinal analysis, I query PostgreSQL to calculate the national trend in NAEP Grade 8 mathematics average scale scores across reporting years.
# SQL query to analyze the national NAEP Grade 8 math trend over time.naep_national_trend_query = read_sql_file("naep_national_trend.sql")# Display the national NAEP math performance trend.pd.read_sql(naep_national_trend_query, engine)
reporting_year
average_scale_score
0
2009
298.854346618482
1
2011
298.512431105164
2
2013
300.568235011147
3
2015
296.908571436848
4
2017
297.0422252108
5
2019
294.474574539442
6
2022
283.587656613142
7
2024
283.465558985794
Visualizing NAEP Grade 8 Math Trends Over Time
To better understand longitudinal educational performance patterns, I create a time-series visualization showing Massachusetts NAEP Grade 8 mathematics performance trends across reporting years.
# Store the SQL query result in a DataFrame.naep_trend_df = pd.read_sql( naep_national_trend_query, engine)# Convert values to numeric types.naep_trend_df["reporting_year"] = pd.to_numeric( naep_trend_df["reporting_year"])naep_trend_df["average_scale_score"] = pd.to_numeric( naep_trend_df["average_scale_score"]).round(2)# Create the visualization.plt.figure(figsize=(12, 6))plt.plot( naep_trend_df["reporting_year"], naep_trend_df["average_scale_score"], marker="o", linewidth=3)# Add labels to each data point.for x, y inzip( naep_trend_df["reporting_year"], naep_trend_df["average_scale_score"]): plt.text( x, y +0.4,f"{y:.2f}", ha="center" )# Add chart title and labels.plt.title("Massachusetts NAEP Grade 8 Math Trends Over Time")plt.xlabel("Reporting Year")plt.ylabel("Average Scale Score")# Improve readability.plt.grid(True)# Display chart.plt.show()
Educational Trend Interpretation
The Massachusetts NAEP Grade 8 mathematics trend analysis shows relatively stable high performance from 2009 through 2019, followed by a substantial decline beginning in 2022.
Several potential contributing factors may explain this pattern:
Post-pandemic educational disruption
Remote learning transition challenges
Unequal access to educational technology resources
Student engagement and learning loss
Changes in instructional delivery models
This longitudinal trend demonstrates how educational outcome analysis can be integrated into a data warehouse environment to support policy evaluation, educational research, and technology impact assessment initiatives.
Comparing Longitudinal NAEP Trends Across Multiple States
To better understand variation in educational outcomes across states, I compare long-term NAEP Grade 8 mathematics trends for several high-performing states.
# SQL query to compare longitudinal trends across multiple states.multi_state_trend_query = read_sql_file("naep_multi_state_trend.sql")# Display the multi-state trend analysis dataset.pd.read_sql(multi_state_trend_query, engine)
state_name
reporting_year
average_scale_score
0
MASSACHUSETTS
2009
298.854346618482
1
MASSACHUSETTS
2011
298.512431105164
2
MASSACHUSETTS
2013
300.568235011147
3
MASSACHUSETTS
2015
296.908571436848
4
MASSACHUSETTS
2017
297.0422252108
5
MASSACHUSETTS
2019
294.474574539442
6
MASSACHUSETTS
2022
283.587656613142
7
MASSACHUSETTS
2024
283.465558985794
8
MINNESOTA
2009
294.443320106986
9
MINNESOTA
2011
294.946422810345
10
MINNESOTA
2013
294.592972603507
11
MINNESOTA
2015
294.147777973073
12
MINNESOTA
2017
293.962375743927
13
MINNESOTA
2019
290.785141202164
14
MINNESOTA
2022
280.066020971965
15
MINNESOTA
2024
282.087420612381
16
NEW JERSEY
2009
292.657227109585
17
NEW JERSEY
2011
294.138771366004
18
NEW JERSEY
2013
296.053351131027
19
NEW JERSEY
2015
293.36593817877
20
NEW JERSEY
2017
291.698552399307
21
NEW JERSEY
2019
291.822771925628
22
NEW JERSEY
2022
280.886098306047
23
NEW JERSEY
2024
281.684697993032
24
UTAH
2009
284.068251870423
25
UTAH
2011
283.308332086503
26
UTAH
2013
284.331486246838
27
UTAH
2015
286.116911188659
28
UTAH
2017
286.816559110458
29
UTAH
2019
284.930389419443
30
UTAH
2022
282.211027508016
31
UTAH
2024
281.79013940125
Visualizing Multi-State NAEP Grade 8 Math Trends
To compare educational performance patterns over time, I create a line chart showing NAEP Grade 8 mathematics trends across multiple high-performing states.
# Store the multi-state SQL query result in a DataFrame.multi_state_trend_df = pd.read_sql( multi_state_trend_query, engine)# Convert columns to numeric values for plotting.multi_state_trend_df["reporting_year"] = pd.to_numeric( multi_state_trend_df["reporting_year"])multi_state_trend_df["average_scale_score"] = pd.to_numeric( multi_state_trend_df["average_scale_score"]).round(2)# Create the multi-state trend visualization.plt.figure(figsize=(12, 6))for state_name in multi_state_trend_df["state_name"].unique(): state_data = multi_state_trend_df[ multi_state_trend_df["state_name"] == state_name ] plt.plot( state_data["reporting_year"], state_data["average_scale_score"], marker="o", linewidth=2, label=state_name )# Add chart title and labels.plt.title("NAEP Grade 8 Math Trends Across Selected States")plt.xlabel("Reporting Year")plt.ylabel("Average Scale Score")# Add legend and grid.plt.legend()plt.grid(True)# Improve spacing.plt.tight_layout()# Display chart.plt.show()
Comparative Educational Trend Interpretation
The multi-state NAEP trend analysis reveals several important educational performance patterns across high-performing states.
Key observations include:
Massachusetts consistently maintains the highest overall mathematics performance levels.
Multiple states experienced substantial score declines beginning after 2019.
The post-pandemic educational disruption appears visible across nearly all analyzed states.
Some states demonstrate stronger recovery stabilization between 2022 and 2024.
Longitudinal educational outcome analysis provides valuable insight into statewide instructional resilience and educational system performance.
This comparative analysis demonstrates how dimensional warehouse modeling and longitudinal analytics can support large-scale educational research and policy evaluation initiatives.
Analyzing Post-Pandemic NAEP Math Score Changes
To quantify the post-pandemic change in educational outcomes, I compare each selected state’s 2019 NAEP Grade 8 mathematics score against its 2024 score.
# SQL query to compare 2019 and 2024 NAEP Grade 8 math scores.post_pandemic_change_query = read_sql_file("naep_post_pandemic_delta.sql")# Display the post-pandemic score change results.pd.read_sql(post_pandemic_change_query, engine)
state_name
score_2019
score_2024
score_change
0
MASSACHUSETTS
294.474575
283.465559
-11.009016
1
NEW JERSEY
291.822772
281.684698
-10.138074
2
MINNESOTA
290.785141
282.087421
-8.697721
3
UTAH
284.930389
281.790139
-3.140250
Data Extraction — CRDC LEA Characteristics Dataset
To begin integrating education technology and district-level characteristics into the warehouse, I load the CRDC LEA Characteristics dataset.
This dataset will help support future analysis of district characteristics, technology access indicators, and educational equity patterns.
# Define the relative path to the CRDC LEA Characteristics dataset.crdc_file_path ="../data/raw/crdc/LEA Characteristics.csv"# Load the CRDC CSV file into a pandas DataFrame.df_crdc = pd.read_csv(crdc_file_path)# Display the first five records to confirm the dataset loaded correctly.df_crdc.head()
C:\Users\User\AppData\Local\Temp\ipykernel_21316\130108529.py:5: DtypeWarning: Columns (0: LEAID) have mixed types. Specify dtype option on import or set low_memory=False.
df_crdc = pd.read_csv(crdc_file_path)
LEA_STATE
LEA_STATE_NAME
LEAID
LEA_NAME
LEA_ADDRESS
LEA_CITY
LEA_ZIP
CJJ
LEA_CRCOORD_SEX_IND
LEA_CRCOORD_RAC_IND
...
LEA_HBPOLICY_IND
LEA_HBPOLICYURL_IND
LEA_HBPOLICY_URL
LEA_ENR
LEA_PSENR_A3
LEA_PSENR_A4
LEA_PSENR_A5
LEA_ENR_NONLEAFAC
LEA_SCHOOLS
LEA_PS_IND
0
MA
MASSACHUSETTS
2506090
Hingham
220 Central Street
Hingham
2043
No
Yes
Yes
...
Yes
Yes
https://z2policy.ctspublish.com/masc/browse/hi...
3912
0
0
24
38
6
Yes
1
MI
MICHIGAN
2627210
Owosso Public Schools
645 Alger Ave
Owosso
48867
No
Yes
Yes
...
Yes
Yes
owosso.k12.mi.us
2994
34
41
0
0
7
Yes
2
TX
TEXAS
4800278
CROSSTIMBERS ACADEMY
236 HARMONY RD
WEATHERFORD
76087
No
Yes
Yes
...
Yes
Yes
https://ctacharter.com/
149
-9
-9
-9
0
1
No
3
OK
OKLAHOMA
4012390
GANS
204 N STACY
GANS
74936
No
Yes
Yes
...
Yes
Yes
www.gans.k12.ok.us
351
16
22
30
0
2
Yes
4
CA
CALIFORNIA
0619230
Junction Elementary
98821 Highway 96
Somes Bar
95568
No
Yes
Yes
...
Yes
No
-9
16
-9
-9
-9
0
1
No
5 rows × 40 columns
Initial Data Inspection — CRDC Dataset
In this step, I inspect the CRDC dataset to understand its size, structure, and available fields before selecting columns for warehouse integration.
# Display the number of rows and columns in the CRDC dataset.crdc_row_count, crdc_column_count = df_crdc.shapeprint(f"Number of CRDC rows: {crdc_row_count}")print(f"Number of CRDC columns: {crdc_column_count}")# Display all CRDC column names to identify fields useful for district characteristics and technology access analysis.df_crdc.columns.tolist()
Number of CRDC rows: 17704
Number of CRDC columns: 40
After inspecting the CRDC dataset, I select district-level fields that can support warehouse integration and future educational equity analysis.
# Select CRDC district characteristic columns for warehouse preparation.crdc_columns = ["LEAID","LEA_NAME","LEA_STATE","LEA_STATE_NAME","LEA_CITY","LEA_ENR","LEA_SCHOOLS","LEA_DESEGPLAN","LEA_HBPOLICY_IND","LEA_PS_IND"]# Create a district-focused CRDC DataFrame.df_crdc_districts = df_crdc[crdc_columns].copy()# Display the selected CRDC district characteristic fields.df_crdc_districts.head()
LEAID
LEA_NAME
LEA_STATE
LEA_STATE_NAME
LEA_CITY
LEA_ENR
LEA_SCHOOLS
LEA_DESEGPLAN
LEA_HBPOLICY_IND
LEA_PS_IND
0
2506090
Hingham
MA
MASSACHUSETTS
Hingham
3912
6
No
Yes
Yes
1
2627210
Owosso Public Schools
MI
MICHIGAN
Owosso
2994
7
No
Yes
Yes
2
4800278
CROSSTIMBERS ACADEMY
TX
TEXAS
WEATHERFORD
149
1
No
Yes
No
3
4012390
GANS
OK
OKLAHOMA
GANS
351
2
No
Yes
Yes
4
0619230
Junction Elementary
CA
CALIFORNIA
Somes Bar
16
1
No
Yes
No
Cleaning the CRDC District Characteristics Dataset
Before loading CRDC district characteristics into PostgreSQL, I standardize column names and remove duplicate district records.
# Create a copy of the selected CRDC district fields.df_crdc_district_dim = df_crdc_districts.copy()# Remove duplicate district records using the LEAID identifier.df_crdc_district_dim = df_crdc_district_dim.drop_duplicates(subset=["LEAID"])# Standardize column names for PostgreSQL compatibility.df_crdc_district_dim.columns = ["district_id","district_name","state_code","state_name","city","enrollment","school_count","desegregation_plan_indicator","harassment_bullying_policy_indicator","preschool_indicator"]# Reset the index after cleaning.df_crdc_district_dim = df_crdc_district_dim.reset_index(drop=True)# Display the cleaned CRDC district characteristics dataset.df_crdc_district_dim.head()
district_id
district_name
state_code
state_name
city
enrollment
school_count
desegregation_plan_indicator
harassment_bullying_policy_indicator
preschool_indicator
0
2506090
Hingham
MA
MASSACHUSETTS
Hingham
3912
6
No
Yes
Yes
1
2627210
Owosso Public Schools
MI
MICHIGAN
Owosso
2994
7
No
Yes
Yes
2
4800278
CROSSTIMBERS ACADEMY
TX
TEXAS
WEATHERFORD
149
1
No
Yes
No
3
4012390
GANS
OK
OKLAHOMA
GANS
351
2
No
Yes
Yes
4
0619230
Junction Elementary
CA
CALIFORNIA
Somes Bar
16
1
No
Yes
No
Loading CRDC District Characteristics into PostgreSQL
After cleaning the CRDC district characteristics dataset, I load it into PostgreSQL as a warehouse table for future district-level enrichment analysis.
# Load the cleaned CRDC district characteristics table into PostgreSQL.df_crdc_district_dim.to_sql( name="dim_crdc_district_characteristics", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("CRDC district characteristics table loaded successfully into PostgreSQL.")
CRDC district characteristics table loaded successfully into PostgreSQL.
Verifying the CRDC District Characteristics Table in PostgreSQL
To confirm the CRDC district characteristics table loaded correctly, I query PostgreSQL and preview the stored records.
# Query PostgreSQL to preview the CRDC district characteristics table.crdc_preview_query ="""SELECT *FROM public.dim_crdc_district_characteristicsLIMIT 10;"""# Display the CRDC district characteristics records.pd.read_sql(crdc_preview_query, engine)
district_id
district_name
state_code
state_name
city
enrollment
school_count
desegregation_plan_indicator
harassment_bullying_policy_indicator
preschool_indicator
0
2506090
Hingham
MA
MASSACHUSETTS
Hingham
3912
6
No
Yes
Yes
1
2627210
Owosso Public Schools
MI
MICHIGAN
Owosso
2994
7
No
Yes
Yes
2
4800278
CROSSTIMBERS ACADEMY
TX
TEXAS
WEATHERFORD
149
1
No
Yes
No
3
4012390
GANS
OK
OKLAHOMA
GANS
351
2
No
Yes
Yes
4
0619230
Junction Elementary
CA
CALIFORNIA
Somes Bar
16
1
No
Yes
No
5
2913290
GREEN RIDGE R-VIII
MO
MISSOURI
GREEN RIDGE
350
2
No
Yes
No
6
3000004
Mountain View Elem
MT
MONTANA
Cut Bank
11
1
No
Yes
No
7
0805940
Otis School District No. R-3
CO
COLORADO
OTIS
211
2
No
Yes
Yes
8
0602174
iLead Agua Dulce District
CA
CALIFORNIA
Agua Dulce
320
1
No
Yes
No
9
0624600
Merced City Elementary
CA
CALIFORNIA
Merced
10922
19
No
Yes
Yes
Validating the CRDC District Characteristics Dataset
Before using the CRDC district characteristics data for analysis, I perform validation checks to confirm completeness and uniqueness.
# Count missing district identifiers.crdc_missing_district_ids = df_crdc_district_dim["district_id"].isnull().sum()# Count missing state names.crdc_missing_state_names = df_crdc_district_dim["state_name"].isnull().sum()# Count missing enrollment values.crdc_missing_enrollment = df_crdc_district_dim["enrollment"].isnull().sum()# Count duplicate district identifiers.crdc_duplicate_district_ids = df_crdc_district_dim["district_id"].duplicated().sum()# Display validation results.crdc_validation_results = pd.DataFrame({"validation_check": ["Missing district IDs","Missing state names","Missing enrollment values","Duplicate district IDs" ],"result": [ crdc_missing_district_ids, crdc_missing_state_names, crdc_missing_enrollment, crdc_duplicate_district_ids ]})crdc_validation_results
validation_check
result
0
Missing district IDs
0
1
Missing state names
0
2
Missing enrollment values
0
3
Duplicate district IDs
0
Analyzing CRDC District Enrollment by State
To begin using the CRDC district characteristics table analytically, I summarize total district enrollment by state.
# SQL query to summarize total CRDC enrollment by state.crdc_enrollment_by_state_query = read_sql_file("crdc_enrollment_by_state.sql")# Display the top states by total CRDC district enrollment.pd.read_sql(crdc_enrollment_by_state_query, engine)
state_name
total_enrollment
0
CALIFORNIA
5958775.0
1
TEXAS
5419285.0
2
FLORIDA
2961487.0
3
NEW YORK
2745258.0
4
ILLINOIS
1882620.0
5
GEORGIA
1751403.0
6
OHIO
1717635.0
7
PENNSYLVANIA
1707999.0
8
NORTH CAROLINA
1537968.0
9
MICHIGAN
1441417.0
Visualizing CRDC District Enrollment by State
To better understand district enrollment distribution, I create a visualization showing the states with the highest total CRDC district enrollment.
# Store the CRDC enrollment query results in a DataFrame.crdc_enrollment_df = pd.read_sql( crdc_enrollment_by_state_query, engine)# Convert enrollment values to numeric.crdc_enrollment_df["total_enrollment"] = pd.to_numeric( crdc_enrollment_df["total_enrollment"])# Sort values for visualization.crdc_enrollment_df = crdc_enrollment_df.sort_values( by="total_enrollment", ascending=True)# Create the chart.plt.figure(figsize=(12, 6))plt.barh( crdc_enrollment_df["state_name"], crdc_enrollment_df["total_enrollment"])# Add chart title and labels.plt.title("Top States by Total CRDC District Enrollment")plt.xlabel("Total Enrollment")plt.ylabel("State")# Improve spacing.plt.tight_layout()# Display chart.plt.show()
Preparing CRDC and NAEP State-Level Comparison Data
To begin connecting district characteristics with educational outcomes, I aggregate CRDC enrollment data to the state level and prepare it for comparison with NAEP state performance results.
# SQL query to aggregate CRDC district enrollment to the state level.crdc_state_enrollment_query ="""SELECT state_name, SUM(CAST(enrollment AS NUMERIC)) AS total_enrollmentFROM public.dim_crdc_district_characteristicsGROUP BY state_nameORDER BY state_name;"""# Display state-level CRDC enrollment totals.pd.read_sql(crdc_state_enrollment_query, engine)
state_name
total_enrollment
0
ALABAMA
738221.0
1
ALASKA
129730.0
2
ARIZONA
1127982.0
3
ARKANSAS
490007.0
4
CALIFORNIA
5958775.0
5
COLORADO
886633.0
6
CONNECTICUT
510407.0
7
DELAWARE
140631.0
8
DISTRICT OF COLUMBIA
89300.0
9
FLORIDA
2961487.0
10
GEORGIA
1751403.0
11
HAWAII
173178.0
12
IDAHO
315523.0
13
ILLINOIS
1882620.0
14
INDIANA
1038088.0
15
IOWA
510553.0
16
KANSAS
481676.0
17
KENTUCKY
653132.0
18
LOUISIANA
664811.0
19
MAINE
164334.0
20
MARYLAND
883833.0
21
MASSACHUSETTS
919041.0
22
MICHIGAN
1441417.0
23
MINNESOTA
890878.0
24
MISSISSIPPI
485450.0
25
MISSOURI
899203.0
26
MONTANA
160320.0
27
NEBRASKA
325764.0
28
NEVADA
486753.0
29
NEW HAMPSHIRE
171005.0
30
NEW JERSEY
1377021.0
31
NEW MEXICO
309714.0
32
NEW YORK
2745258.0
33
NORTH CAROLINA
1537968.0
34
NORTH DAKOTA
116594.0
35
OHIO
1717635.0
36
OKLAHOMA
699604.0
37
OREGON
553921.0
38
PENNSYLVANIA
1707999.0
39
PUERTO RICO
259535.0
40
RHODE ISLAND
181359.0
41
SOUTH CAROLINA
742636.0
42
SOUTH DAKOTA
141475.0
43
TENNESSEE
989550.0
44
TEXAS
5419285.0
45
UTAH
705970.0
46
VERMONT
82602.0
47
VIRGINIA
1291606.0
48
WASHINGTON
1091100.0
49
WEST VIRGINIA
248994.0
50
WISCONSIN
829306.0
51
WYOMING
92493.0
Joining CRDC Enrollment with 2024 NAEP Math Performance
To begin comparing district characteristics with educational outcomes, I join state-level CRDC enrollment totals with 2024 NAEP Grade 8 mathematics performance scores.
# SQL query to join CRDC state enrollment totals with 2024 NAEP math performance.crdc_naep_comparison_query = read_sql_file("crdc_naep_comparison.sql")# Display the joined CRDC and NAEP comparison dataset.pd.read_sql(crdc_naep_comparison_query, engine)
state_name
total_enrollment
average_scale_score
0
MASSACHUSETTS
919041.0
283.465559
1
WISCONSIN
829306.0
282.649634
2
MINNESOTA
890878.0
282.087421
3
UTAH
705970.0
281.790139
4
NEW JERSEY
1377021.0
281.684698
5
SOUTH DAKOTA
141475.0
281.101137
6
NEBRASKA
325764.0
279.883496
7
NORTH DAKOTA
116594.0
279.765445
8
NEW HAMPSHIRE
171005.0
279.744642
9
MONTANA
160320.0
279.082096
10
OHIO
1717635.0
278.812721
11
WYOMING
92493.0
278.315324
12
INDIANA
1038088.0
278.185064
13
COLORADO
886633.0
278.107690
14
IDAHO
315523.0
278.104479
15
ILLINOIS
1882620.0
277.416366
16
CONNECTICUT
510407.0
276.674248
17
PENNSYLVANIA
1707999.0
276.245915
18
TENNESSEE
989550.0
275.982988
19
NORTH CAROLINA
1537968.0
275.821518
20
VERMONT
82602.0
275.617661
21
VIRGINIA
1291606.0
274.988690
22
IOWA
510553.0
274.823931
23
KANSAS
481676.0
274.136728
24
WASHINGTON
1091100.0
273.665174
25
MAINE
164334.0
272.621883
26
NEW YORK
2745258.0
270.851724
27
KENTUCKY
653132.0
270.813300
28
MISSOURI
899203.0
270.363197
29
HAWAII
173178.0
270.041103
30
MICHIGAN
1441417.0
269.978774
31
RHODE ISLAND
181359.0
269.826789
32
ARIZONA
1127982.0
269.656164
33
TEXAS
5419285.0
269.393041
34
MISSISSIPPI
485450.0
269.060485
35
CALIFORNIA
5958775.0
268.779818
36
GEORGIA
1751403.0
268.655805
37
MARYLAND
883833.0
268.213760
38
SOUTH CAROLINA
742636.0
268.006371
39
OREGON
553921.0
267.940476
40
FLORIDA
2961487.0
267.242908
41
LOUISIANA
664811.0
266.783264
42
ARKANSAS
490007.0
266.194834
43
NEVADA
486753.0
265.177162
44
OKLAHOMA
699604.0
264.454031
45
ALASKA
129730.0
263.962653
46
DELAWARE
140631.0
263.057137
47
ALABAMA
738221.0
261.769896
48
DISTRICT OF COLUMBIA
89300.0
261.538791
49
WEST VIRGINIA
248994.0
260.769078
50
NEW MEXICO
309714.0
256.210854
51
PUERTO RICO
259535.0
216.087041
Visualizing CRDC Enrollment and NAEP Math Performance
To begin exploring relationships between district characteristics and educational outcomes, I create a scatterplot comparing total CRDC district enrollment with 2024 NAEP Grade 8 mathematics performance.
# Store the joined CRDC and NAEP comparison query result in a DataFrame.crdc_naep_comparison_df = pd.read_sql( crdc_naep_comparison_query, engine)# Convert numeric fields.crdc_naep_comparison_df["total_enrollment"] = pd.to_numeric( crdc_naep_comparison_df["total_enrollment"])crdc_naep_comparison_df["average_scale_score"] = pd.to_numeric( crdc_naep_comparison_df["average_scale_score"]).round(2)# Select the top 10 states by enrollment.top_enrollment_states = crdc_naep_comparison_df.nlargest(10,"total_enrollment")# Sort values for visualization.top_enrollment_states = top_enrollment_states.sort_values( by="total_enrollment", ascending=True)# Create the chart.plt.figure(figsize=(12, 6))bars = plt.barh( top_enrollment_states["state_name"], top_enrollment_states["total_enrollment"])# Add NAEP score labels to bars.for bar, score inzip( bars, top_enrollment_states["average_scale_score"]): plt.text( bar.get_width(), bar.get_y() + bar.get_height() /2,f" NAEP: {score}", va="center" )# Add chart title and labels.plt.title("Top States by Enrollment with 2024 NAEP Math Scores")plt.xlabel("Total State Enrollment")plt.ylabel("State")# Add additional chart spacing for labels.plt.xlim(0, top_enrollment_states["total_enrollment"].max() *1.15)# Improve spacing.plt.tight_layout()# Display chart.plt.show()
CRDC and NAEP Comparison Interpretation
This comparison connects CRDC district enrollment totals with 2024 NAEP Grade 8 mathematics performance scores.
The purpose of this step is not to prove a causal relationship, but to begin exploring whether large enrollment systems show different performance patterns than smaller state systems.
Key observations from this comparison include:
Larger enrollment states do not automatically produce higher NAEP mathematics scores.
Some high-enrollment states perform near the middle of the NAEP score range.
Smaller or mid-sized states may outperform larger systems on average scale scores.
Enrollment size alone is not enough to explain academic performance outcomes.
Future CRDC technology-access indicators will provide stronger explanatory variables for technology impact analysis.
This step establishes the first bridge between district characteristics and educational outcomes inside the warehouse.
Analyzing Enrollment Size and Educational Performance
Now I want to compare state enrollment totals against 2024 NAEP Grade 8 math scores. Two changes from the obvious approach matter here.
First, I use pd.qcut instead of pd.cut for the size buckets. With states like California and Texas at the top of the distribution, equal-width bins would put roughly two states in “Large” and everything else in “Small” — the means wouldn’t be comparable. Quantile-based bins put a roughly equal count of states in each bucket, which makes the group averages meaningful.
Second, I compute a Spearman rank correlation alongside the bucket summary. The buckets give a quick visual; the correlation gives an honest single number for whether enrollment scale and NAEP scores actually move together.
from scipy.stats import spearmanr# Working copy of the joined CRDC enrollment + NAEP 2024 dataset.enrollment_performance_analysis_df = crdc_naep_comparison_df.copy()# Quantile-based size buckets: each tercile holds roughly the same number of states.enrollment_performance_analysis_df["enrollment_tercile"] = pd.qcut( enrollment_performance_analysis_df["total_enrollment"], q=3, labels=["Small (bottom third)", "Medium (middle third)", "Large (top third)"])# Mean score plus state count per bucket — count matters; it tells you how# unbalanced the buckets actually are.enrollment_performance_summary = ( enrollment_performance_analysis_df .groupby("enrollment_tercile", observed=True) .agg( states_in_bucket=("state_name", "count"), mean_naep_score=("average_scale_score", "mean"), ) .reset_index())# Spearman rank correlation between enrollment and NAEP score.# Rank-based, so robust to California/Texas being extreme outliers.rho, p_value = spearmanr( enrollment_performance_analysis_df["total_enrollment"], enrollment_performance_analysis_df["average_scale_score"],)print(enrollment_performance_summary.to_string(index=False))print()print(f"Spearman rank correlation: rho = {rho:.3f}, p-value = {p_value:.3f}")
enrollment_tercile states_in_bucket mean_naep_score
Small (bottom third) 18 268.881667
Medium (middle third) 17 271.464118
Large (top third) 17 274.167059
Spearman rank correlation: rho = 0.104, p-value = 0.462
Visualizing Enrollment Size and Educational Performance
A scatter plot is the right shape for this question. Each point is one state; the x-axis is total enrollment, the y-axis is the 2024 NAEP Grade 8 math score. If there’s a relationship, you see it in the cloud.
fig, ax = plt.subplots(figsize=(11, 6))ax.scatter( enrollment_performance_analysis_df["total_enrollment"], enrollment_performance_analysis_df["average_scale_score"], s=40,)# Label the most extreme points so the chart isn't anonymous.extremes = pd.concat([ enrollment_performance_analysis_df.nlargest(3, "total_enrollment"), enrollment_performance_analysis_df.nsmallest(3, "total_enrollment"), enrollment_performance_analysis_df.nlargest(2, "average_scale_score"), enrollment_performance_analysis_df.nsmallest(2, "average_scale_score"),]).drop_duplicates(subset=["state_name"])for _, row in extremes.iterrows(): ax.annotate( row["state_name"].title(), (row["total_enrollment"], row["average_scale_score"]), xytext=(5, 5), textcoords="offset points", fontsize=9, )ax.set_title(f"State Enrollment vs. 2024 NAEP Grade 8 Math Score (Spearman ρ = {rho:.2f})")ax.set_xlabel("Total State Enrollment (CRDC)")ax.set_ylabel("2024 NAEP Grade 8 Math — Average Scale Score")ax.grid(True, alpha=0.3)plt.tight_layout()plt.show()
Enrollment Size and Educational Performance Interpretation
The Spearman correlation is close to zero, which is the honest answer: across U.S. states in 2024, total enrollment and NAEP Grade 8 math performance are not meaningfully related. Massachusetts is mid-sized and tops the chart; Texas and California are huge and sit near the middle of the score distribution; the highest-scoring small state isn’t dramatically higher than the highest-scoring large state.
A few caveats worth being explicit about:
Aggregate state-level data can’t answer a district-level question. Two states with the same average can have very different distributions underneath.
Total enrollment is a proxy for “system size,” not for any specific causal factor. Funding, demographics, instructional approach, and dozens of other things vary across states.
The bucket means are informative but the buckets are small (~17 states each). Don’t read too much into small differences between bucket means.
The takeaway for this project is more methodological than substantive: enrollment scale alone isn’t useful as an explanatory variable for academic performance, so the more interesting question is whether technology participation — distance education enrollment specifically — shows a relationship. That’s what the next section explores.
Project Summary and Key Findings
This project demonstrates the development of an end-to-end educational technology impact analysis workflow using publicly available education and technology-related datasets.
The primary objective is to evaluate educational outcome trends and explore how district-level characteristics, enrollment patterns, and technology-enabled learning participation may relate to student performance.
To support this analysis, I:
Collected educational and technology-related datasets from multiple public sources
Built a PostgreSQL dimensional warehouse environment
Designed ETL workflows for data integration and transformation
Connected district, state, year, educational outcome, and distance education datasets
Performed SQL-based analytical reporting
Created visualizations to communicate educational trends and comparisons
Integrated CRDC district characteristics and distance education participation metrics into the warehouse environment
Key findings from the current analysis include:
Educational performance varies significantly across states.
Post-pandemic NAEP mathematics scores declined across many educational systems.
Larger enrollment systems do not necessarily produce stronger educational outcomes.
Distance education participation varies widely across states and districts.
Higher distance education participation alone does not automatically correspond to stronger NAEP mathematics outcomes.
Technology adoption appears most meaningful when evaluated alongside implementation quality, access, instructional context, enrollment scale, and educational equity.
This project demonstrates practical experience in data engineering, analytics engineering, PostgreSQL development, ETL pipeline design, SQL analytics, data visualization, and research-oriented reporting workflows.
Loading the CRDC Distance Education Dataset
To evaluate technology adoption and digital learning participation, I load the CRDC Distance Education dataset into the analytics workflow.
# Load the CRDC Distance Education dataset.distance_education_path ="../data/raw/crdc/Distance Education.csv"df_distance_education = pd.read_csv( distance_education_path, encoding="latin1", low_memory=False)# Display dataset dimensions.print("Number of rows:", df_distance_education.shape[0])print("Number of columns:", df_distance_education.shape[1])# Display column names.df_distance_education.columns.tolist()
Cleaning and Standardizing the Distance Education Dataset
Before integrating the distance education dataset into the warehouse workflow, I standardize column names and prepare the dataset for analytical processing.
# Create a cleaned copy of the distance education dataset.df_distance_education_clean = df_distance_education.copy()# Standardize column names.df_distance_education_clean.columns = ( df_distance_education_clean.columns .str.lower() .str.strip())# Rename key columns for readability.df_distance_education_clean = df_distance_education_clean.rename(columns={"lea_state_name": "state_name","leaid": "district_id","lea_name": "district_name","lea_disted_ind": "distance_education_indicator","tot_distedenr_m": "male_distance_education_enrollment","tot_distedenr_f": "female_distance_education_enrollment"})# Display cleaned column names.df_distance_education_clean.columns.tolist()
To begin evaluating technology adoption patterns, I summarize the number of districts reporting distance education participation by state.
# SQL-style aggregation using pandas.distance_education_by_state = ( df_distance_education_clean .groupby("state_name")["distance_education_indicator"] .apply(lambda x: (x =="Yes").sum()) .reset_index(name="districts_using_distance_education"))# Sort results.distance_education_by_state = ( distance_education_by_state .sort_values( by="districts_using_distance_education", ascending=False ))# Display top states using distance education.distance_education_by_state.head(10)
state_name
districts_using_distance_education
44
TEXAS
506
22
MICHIGAN
364
38
PENNSYLVANIA
315
15
IOWA
275
35
OHIO
275
32
NEW YORK
272
3
ARKANSAS
247
4
CALIFORNIA
239
25
MISSOURI
229
50
WISCONSIN
220
Visualizing Distance Education Adoption by State
To better understand technology adoption patterns, I create a visualization showing the states with the highest number of districts participating in distance education.
# Select the top 10 states using distance education.top_distance_education_states = ( distance_education_by_state .head(10) .sort_values( by="districts_using_distance_education", ascending=True ))# Create the chart.plt.figure(figsize=(12, 6))plt.barh( top_distance_education_states["state_name"], top_distance_education_states["districts_using_distance_education"])# Add chart title and labels.plt.title("Top States by District Distance Education Adoption")plt.xlabel("Districts Using Distance Education")plt.ylabel("State")# Improve spacing.plt.tight_layout()# Display chart.plt.show()
Creating the Distance Education Adoption Fact Table
Before aggregating, a data-quality note: the CRDC distance-education file uses -9 as a reserved code for “missing or suppressed value,” not as a real student count. Roughly 65% of the districts in this file (11,384 out of 17,604) report -9 for both the male and female totals. If those codes are treated as numbers and summed, the resulting state totals are nonsense — several states end up with negative student counts. I replace the sentinel codes with NaN so they’re excluded from the sum rather than corrupting it.
# Pull the columns we need from the cleaned distance-ed dataset.fact_distance_education = df_distance_education_clean[ ["district_id","state_name","district_name","distance_education_indicator","male_distance_education_enrollment","female_distance_education_enrollment" ]].copy()# Coerce to numeric. errors="coerce" turns non-numeric strings into NaN.for col in ["male_distance_education_enrollment", "female_distance_education_enrollment"]: fact_distance_education[col] = pd.to_numeric( fact_distance_education[col], errors="coerce" )# Replace CRDC sentinel codes for suppressed/missing data with NaN.# Negative values are not valid student counts in this file.sentinel_codes = [-3, -5, -6, -8, -9, -11]for col in ["male_distance_education_enrollment", "female_distance_education_enrollment"]: fact_distance_education.loc[ fact_distance_education[col].isin(sentinel_codes), col ] = np.nan# Total enrollment per district. With sentinels nulled out, the sum treats# missing values as missing instead of as -9.fact_distance_education["total_distance_education_enrollment"] = ( fact_distance_education["male_distance_education_enrollment"] .add(fact_distance_education["female_distance_education_enrollment"], fill_value=0))# How much of the file was affected? Useful to show explicitly.total_districts =len(fact_distance_education)suppressed_districts = ( fact_distance_education["male_distance_education_enrollment"].isna()& fact_distance_education["female_distance_education_enrollment"].isna()).sum()print(f"Districts in file: {total_districts:,}")print(f"Districts with both M and F suppressed (no usable count): {suppressed_districts:,} "f"({suppressed_districts / total_districts:.1%})")fact_distance_education.head()
Districts in file: 17,604
Districts with both M and F suppressed (no usable count): 11,384 (64.7%)
district_id
state_name
district_name
distance_education_indicator
male_distance_education_enrollment
female_distance_education_enrollment
total_distance_education_enrollment
0
0100002
ALABAMA
Alabama Youth Services
No
NaN
NaN
NaN
1
0100005
ALABAMA
Albertville City
Yes
4.0
14.0
18.0
2
0100006
ALABAMA
Marshall County
Yes
64.0
75.0
139.0
3
0100007
ALABAMA
Hoover City
Yes
218.0
340.0
558.0
4
0100008
ALABAMA
Madison City
Yes
89.0
154.0
243.0
Loading the Distance Education Fact Table into PostgreSQL
After creating the distance education fact table, I load it into PostgreSQL so technology adoption metrics can be queried with SQL and connected to the broader education warehouse.
# Load the distance education fact table into PostgreSQL.fact_distance_education.to_sql( name="fact_distance_education_adoption", con=engine, schema="public", if_exists="replace", index=False)# Confirm successful PostgreSQL table creation.print("Distance education adoption fact table loaded successfully into PostgreSQL.")
Distance education adoption fact table loaded successfully into PostgreSQL.
Verifying the Distance Education Fact Table in PostgreSQL
To confirm the distance education adoption fact table loaded correctly, I query PostgreSQL and preview the stored technology adoption records.
To evaluate the scale of technology-enabled learning participation, I summarize total distance education enrollment by state.
# SQL query to summarize distance education enrollment by state.distance_education_enrollment_query = read_sql_file("distance_ed_top_states.sql")# Display top states by distance education enrollment.pd.read_sql(distance_education_enrollment_query, engine)
state_name
total_distance_education_enrollment
0
TEXAS
103166.0
1
CALIFORNIA
98436.0
2
GEORGIA
72605.0
3
ARKANSAS
59193.0
4
MICHIGAN
51837.0
5
PENNSYLVANIA
48042.0
6
IOWA
43741.0
7
VIRGINIA
42440.0
8
NORTH CAROLINA
38877.0
9
OHIO
38492.0
Visualizing Distance Education Enrollment by State
To better understand technology-enabled learning participation, I create a visualization showing the states with the highest distance education enrollment totals.
# Store the distance education enrollment query results in a DataFrame.distance_education_enrollment_df = pd.read_sql( distance_education_enrollment_query, engine)# Convert numeric values.distance_education_enrollment_df["total_distance_education_enrollment"] = pd.to_numeric( distance_education_enrollment_df["total_distance_education_enrollment" ])# Sort values for visualization.distance_education_enrollment_df = ( distance_education_enrollment_df .sort_values( by="total_distance_education_enrollment", ascending=True ))# Create the chart.plt.figure(figsize=(12, 6))plt.barh( distance_education_enrollment_df["state_name"], distance_education_enrollment_df["total_distance_education_enrollment" ])# Add chart title and labels.plt.title("Top States by Distance Education Enrollment")plt.xlabel("Distance Education Enrollment")plt.ylabel("State")# Improve spacing.plt.tight_layout()# Display chart.plt.show()
Connecting Distance Education Adoption with NAEP Math Performance
To begin evaluating the relationship between technology-enabled learning participation and student outcomes, I join state-level distance education enrollment with 2024 NAEP Grade 8 mathematics performance scores.
# SQL query joining distance education enrollment with 2024 NAEP math scores.distance_education_naep_query = read_sql_file("distance_ed_naep_comparison.sql")# Display the joined distance education and NAEP performance dataset.pd.read_sql(distance_education_naep_query, engine)
state_name
total_distance_education_enrollment
average_scale_score
0
TEXAS
103166.0
269.393041
1
CALIFORNIA
98436.0
268.779818
2
GEORGIA
72605.0
268.655805
3
ARKANSAS
59193.0
266.194834
4
MICHIGAN
51837.0
269.978774
5
PENNSYLVANIA
48042.0
276.245915
6
IOWA
43741.0
274.823931
7
VIRGINIA
42440.0
274.988690
8
NORTH CAROLINA
38877.0
275.821518
9
OHIO
38492.0
278.812721
10
MASSACHUSETTS
34140.0
283.465559
11
ARIZONA
34048.0
269.656164
12
NEVADA
30938.0
265.177162
13
ALABAMA
25343.0
261.769896
14
SOUTH CAROLINA
21815.0
268.006371
15
COLORADO
21727.0
278.107690
16
UTAH
21405.0
281.790139
17
INDIANA
20332.0
278.185064
18
NEW MEXICO
18179.0
256.210854
19
OREGON
18157.0
267.940476
20
LOUISIANA
14579.0
266.783264
21
WASHINGTON
14276.0
273.665174
22
FLORIDA
14121.0
267.242908
23
WISCONSIN
12322.0
282.649634
24
IDAHO
11994.0
278.104479
25
MISSOURI
10067.0
270.363197
26
ILLINOIS
9845.0
277.416366
27
MINNESOTA
9657.0
282.087421
28
NEW YORK
8990.0
270.851724
29
KENTUCKY
8462.0
270.813300
30
MARYLAND
6948.0
268.213760
31
OKLAHOMA
6893.0
264.454031
32
NEW JERSEY
6222.0
281.684698
33
ALASKA
6100.0
263.962653
34
TENNESSEE
5783.0
275.982988
35
NORTH DAKOTA
4195.0
279.765445
36
KANSAS
4096.0
274.136728
37
NEW HAMPSHIRE
3556.0
279.744642
38
MISSISSIPPI
3246.0
269.060485
39
NEBRASKA
3136.0
279.883496
40
WEST VIRGINIA
2750.0
260.769078
41
SOUTH DAKOTA
2739.0
281.101137
42
MONTANA
2510.0
279.082096
43
DELAWARE
1693.0
263.057137
44
WYOMING
1566.0
278.315324
45
MAINE
1395.0
272.621883
46
CONNECTICUT
1297.0
276.674248
47
PUERTO RICO
1190.0
216.087041
48
VERMONT
1051.0
275.617661
49
HAWAII
813.0
270.041103
50
RHODE ISLAND
745.0
269.826789
51
DISTRICT OF COLUMBIA
218.0
261.538791
Visualizing Distance Education Enrollment and NAEP Math Performance
Same approach as the enrollment-size analysis: scatter plot with a Spearman correlation. Each point is one state.
distance_education_naep_df = pd.read_sql(distance_education_naep_query, engine)distance_education_naep_df["total_distance_education_enrollment"] = pd.to_numeric( distance_education_naep_df["total_distance_education_enrollment"])distance_education_naep_df["average_scale_score"] = pd.to_numeric( distance_education_naep_df["average_scale_score"]).round(2)# Spearman correlation between distance-ed enrollment and 2024 NAEP score.rho_de, p_de = spearmanr( distance_education_naep_df["total_distance_education_enrollment"], distance_education_naep_df["average_scale_score"],)fig, ax = plt.subplots(figsize=(11, 6))ax.scatter( distance_education_naep_df["total_distance_education_enrollment"], distance_education_naep_df["average_scale_score"], s=40,)# Annotate the most extreme states on each axis so the chart isn't anonymous.de_extremes = pd.concat([ distance_education_naep_df.nlargest(4, "total_distance_education_enrollment"), distance_education_naep_df.nlargest(2, "average_scale_score"), distance_education_naep_df.nsmallest(2, "average_scale_score"),]).drop_duplicates(subset=["state_name"])for _, row in de_extremes.iterrows(): ax.annotate( row["state_name"].title(), (row["total_distance_education_enrollment"], row["average_scale_score"]), xytext=(5, 5), textcoords="offset points", fontsize=9, )ax.set_title(f"Distance Education Enrollment vs. 2024 NAEP Math Score (Spearman ρ = {rho_de:.2f})")ax.set_xlabel("State Distance Education Enrollment (CRDC, sentinel codes excluded)")ax.set_ylabel("2024 NAEP Grade 8 Math — Average Scale Score")ax.grid(True, alpha=0.3)plt.tight_layout()plt.show()print(f"Spearman: rho = {rho_de:.3f}, p-value = {p_de:.3f}")
Spearman: rho = -0.026, p-value = 0.856
Distance Education and NAEP Performance Interpretation
The headline number: Spearman ρ is near zero, p-value well above 0.05. There’s no meaningful rank correlation between state distance-education enrollment and 2024 NAEP Grade 8 math scores. Texas leads in raw distance-ed enrollment and sits below the median on NAEP; Massachusetts has middling distance-ed enrollment and tops the score chart. The relationship just isn’t there at the state aggregate level.
A few things worth being explicit about so this isn’t overclaimed:
This is correlation, not causation. Even a strong correlation here wouldn’t tell you whether distance education helps, hurts, or has no effect — districts adopting distance ed at scale differ from non-adopters in many ways beyond their technology choices.
State-level aggregates hide everything interesting underneath. A state with 100,000 distance-ed students could be 100 districts with 1,000 each, or 5 districts with 20,000 each. Whether the program “worked” is a district-level question that disappears in the rollup.
2024 is one year. Distance education adoption surged 2020–2022 and has been receding since. A single-year snapshot can’t show that arc. Useful follow-up: trend fact_distance_education_adoption against NAEP year-over-year.
Distance-ed enrollment is not the same as technology access. A district can have universal device + broadband access and zero distance-ed enrollment. CRDC has fields closer to “access” (1:1 device programs, broadband infrastructure) that I haven’t pulled into the warehouse yet — that’s the natural next layer.
What this analysis does establish is the pipeline plumbing: every CRDC district can be joined to its state, its enrollment, its distance-ed participation, and its state’s NAEP scores in one SQL query. Once technology-access fields are added, the same join shape extends to answer the more specific question.
Final Technology Impact Analysis Conclusion
This project built an end-to-end data engineering and analytics workflow to ask a focused question: does technology-enabled learning participation in U.S. public schools associate with measurable differences in academic performance? The short answer, from the data integrated here, is no — not at the state aggregate level, and not in a way that survives basic statistical scrutiny.
The longer answer is that the question is harder than it looks. The most informative outcome of this project isn’t the correlation itself; it’s the warehouse and the pipeline that make the next, harder version of the question tractable. Once district-level technology access measures (1:1 device programs, broadband, instructional technology investment) are layered into the same fact-table structure, the same join shape — district → state → year → outcome — can answer a question that state-level distance-ed enrollment can’t.
What this iteration delivered:
Four federal datasets (CCD operational data, NAEP Grade 8 math, CRDC district characteristics, CRDC distance education) extracted, cleaned, and loaded into PostgreSQL
A dimensional warehouse (dim_states, dim_districts, dim_years, plus four fact tables) that supports state-level, district-level, and longitudinal analysis
A documented data-quality fix for the CRDC -9 sentinel-code issue — 65% of districts in the distance-ed file had suppressed values that produced negative state totals in earlier iterations; the corrected aggregation drops California’s total from understated-negative territory to ~98K students and Texas to ~103K
Honest statistical reporting — Spearman correlations with p-values rather than eyeballed bar charts, and scatter plots that don’t hide the underlying distribution
A reproducible report rendered with Quarto that ties code, results, and interpretation together in one place
What this iteration didn’t deliver, and which I’d want to add next:
District-level technology access measures from CRDC, not just distance-ed participation
Year-over-year trend modeling for the post-pandemic period using fact_distance_education_adoption over time
Tests on the ETL transforms — particularly around the sentinel-code handling, since that’s the bug class that bit me here
The skills exercised across the workflow: dimensional warehouse design in PostgreSQL, SQL analytics (CTEs, conditional aggregation, multi-source joins), data quality investigation and remediation, statistical reporting with appropriate caveats, and reproducible reporting with Quarto.
Live Tableau Dashboards
The warehouse and analysis in this report also power three interactive dashboards published on Tableau Public. They’re built on top of the same fact and dimension tables — the static charts here are a snapshot; the dashboards let you filter, hover, and explore the data yourself.