we have an EJ story! and a good one!

This commit is contained in:
2025-07-10 13:22:35 -07:00
parent da4358728d
commit afff5eebd4
27 changed files with 7398 additions and 190 deletions

View File

@@ -1 +0,0 @@
David Adams,dadams,thinkingdead,04.07.2025 23:33,file:///home/dadams/.config/libreoffice/4;

View File

@@ -1,73 +0,0 @@
{
"summary_statistics": {
"total_incidents": 16890,
"date_range": "1994-11-14 to 2024-06-15",
"counties_affected": 33,
"operators_involved": 296
},
"demographic_statistics": {
"total_spills": 16890,
"avg_median_income": 79281.58957963291,
"avg_poverty_rate": 10.344773143016967,
"avg_white_percentage": 83.5093530389343,
"avg_hispanic_percentage": 22.542174310346685,
"avg_unemployment": 2.652711938767639
},
"environmental_justice_analysis": {
"high_poverty_spills": 3497,
"high_poverty_avg_volume": 0.0,
"minority_community_spills": 1047,
"spills_by_income_quartile": {
"Q1(Lowest)": 5244,
"Q2": 3814,
"Q3": 4170,
"Q4(Highest)": 3662
},
"major_spills_by_poverty": {
"high_poverty_major": 1289,
"low_poverty_major": 3599
}
},
"root_cause_analysis": {
"cause_counts": {
"human_error": 684.0,
"equipment_failure": 2023.0,
"historical_unknown": 805.0,
"other": 175.0
},
"top_root_causes": {
"Historical impacts were discovered during flowline decommissioning activities.": 204,
"Historical impacts were discovered during tank battery decommissioning activities.": 187,
"Historical impacts were discovered during wellhead cut and cap activities.": 160,
"Historically impacted soils were discovered following cut and cap operations at the wellhead.": 61,
"Unknown": 60,
"Historical impacts were discovered following cut and cap operations at the wellhead.": 56,
"Historically impacted soils were discovered following facility decommissioning operations at the facility.": 34,
"Historical impacts were discovered during tank battery dismantlement.": 30,
"A root cause cannot be determined since this release is considered historical.": 27,
"Historical impacts were discovered following facility decommissioning operations at the facility.": 21
}
},
"demographic_patterns": {
"spills_by_income": {
"Low Income": 11888,
"Middle Income": 4255,
"High Income": 747
},
"spills_by_poverty": {
"Low Poverty": 9668,
"Moderate Poverty": 4181,
"High Poverty": 2882
},
"spills_by_race": {
"Majority White": 15839,
"Minority Community": 1051
},
"volume_by_demographics": {
"high_poverty_major_spills": 1289,
"minority_major_spills": 314
}
},
"llm_theme_analysis": " Title: Regulatory Summary for Equipment Maintenance, Operational Improvements, and Environmental Protection in Oil and Gas Operations\n\n1. Equipment Failure Patterns:\n - Gasket failures (Check valves, wellheads)\n - Ball valve failures (Wellheads, tanks)\n - Needle valve failures (Wellheads, tanks)\n - Frozen valves (Wellheads, tanks)\n - Transfer hose ruptures (Water haulers)\n\n2. Most Common Operational Issues:\n - Inadequate maintenance and inspection of equipment parts\n - Poor weather conditions affecting valve functionality\n - Human error during operation and maintenance activities\n - Lack of proper training for operators\n - Insufficient response time in detecting and addressing leaks or spills\n\n3. Environmental Risk Factors:\n - Contamination of soil and groundwater from spills or leaks\n - Impact on local ecosystems due to oil and water release\n - Potential harm to wildlife and other flora and fauna\n - Increased greenhouse gas emissions as a result of operational inefficiencies\n\n4. Human Factor Patterns:\n - Lack of awareness and adherence to safety protocols\n - Insufficient communication and coordination among team members\n - Inadequate supervision and oversight during critical tasks\n - Worker fatigue or distraction leading to errors\n - Limited access to proper tools, resources, and equipment for maintenance and repairs\n\n5. Recommendations for Prevention:\n - Implement regular equipment inspections and maintenance schedules\n - Train operators on proper operation, maintenance, and emergency response procedures\n - Ensure that equipment is winterized or protected against harsh weather conditions\n - Develop clear communication protocols among team members and with third parties\n - Provide adequate resources, tools, and safety equipment to workers for safe and efficient operations.",
"llm_environmental_justice": " Environmental Justice Assessment:\n\n1. Vulnerable Communities and Severe Incidents:\n From the provided data, it appears that there is a higher concentration of oil and gas facilities in the areas designated as \"minority communities\" or near historically impacted sites. This suggests that these communities may indeed face more severe incidents due to the proximity of these facilities. For example, the Small Eyed 14C-35HZ well and Carter Keith A UN 2 O SA production facility are located in areas designated as \"minority communities\" and have reported incidents. However, it is essential to note that this analysis is based on a small dataset and may not fully represent the broader picture. Further research would be necessary to confirm this trend and understand its underlying causes.\n\n2. Quality of Response and Remediation:\n The response time for reporting incidents seems generally prompt in most cases, with remedial actions such as soil sampling and cleanup following shortly after. However, it is not clear from the provided data whether the quality of these responses varies between majority and minority communities. It would be beneficial to investigate this further, perhaps by comparing incident response times and remediation outcomes across different community types.\n\n3. Policy Recommendations for Equitable Environmental Protection:\n To ensure equitable environmental protection for all communities, policy recommendations could include:\n\n a) Strengthening the enforcement of regulations governing oil and gas facilities in vulnerable communities to minimize potential incidents.\n\n b) Increasing community engagement and education on their rights, risks, and responsibilities related to oil and gas operations near their neighborhoods.\n\n c) Providing resources for independent environmental monitoring in these communities to facilitate early detection of incidents and improved response times.\n\n d) Prioritizing the development of green infrastructure and renewable energy projects in historically impacted areas as a means of transitioning away from fossil fuel reliance and reducing exposure to associated risks.\n\n e) Establishing funding mechanisms specifically designed to support environmental cleanup efforts in vulnerable communities affected by historical oil and gas operations.\n\n f) Implementing stricter penalties for companies found guilty of environmental violations, particularly those occurring in areas where vulnerable populations reside."
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.0 MiB

File diff suppressed because it is too large Load Diff

View File

@@ -1,450 +0,0 @@
import pandas as pd
import geopandas as gpd
import numpy as np
from scipy import stats
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN
import esda
from libpysal.weights import Queen, KNN
from splot.esda import moran_scatterplot, lisa_cluster
import requests
import json
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.formula.api import ols
import contextily as ctx
import warnings
warnings.filterwarnings('ignore')
def query_ollama(prompt, model="mistral"):
"""Send query to local Ollama instance"""
try:
response = requests.post('http://localhost:11434/api/generate',
json={
'model': model,
'prompt': prompt,
'stream': False
})
return response.json()['response']
except Exception as e:
print(f"Error querying Ollama: {e}")
return None
def statistical_disparity_tests(df):
"""Perform statistical tests for environmental justice disparities"""
print("STATISTICAL SIGNIFICANCE TESTS")
print("="*50)
results = {}
# 1. Income Quartile Analysis
income_quartiles = pd.qcut(df['median_household_income'], 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
spill_counts = df.groupby(income_quartiles).size()
# Chi-square test for income distribution
expected_per_quartile = len(df) / 4
chi2_income, p_income = stats.chisquare(spill_counts, f_exp=[expected_per_quartile] * 4)
print(f"Income Distribution Test:")
print(f" Chi-square statistic: {chi2_income:.3f}")
print(f" p-value: {p_income:.6f}")
print(f" Significant disparity: {'YES' if p_income < 0.001 else 'NO'}")
# 2. Poverty Rate Analysis
high_poverty = df['percent_poverty'] > 15
high_poverty_spills = high_poverty.sum()
total_spills = len(df)
# Assuming 20% of census tracts are high poverty (national average)
expected_high_poverty = 0.20 * total_spills
print(f"\nPoverty Analysis:")
print(f" High-poverty spills: {high_poverty_spills}")
print(f" Expected (if random): {expected_high_poverty:.0f}")
print(f" Ratio: {high_poverty_spills / expected_high_poverty:.2f}x")
# Binomial test
poverty_test = stats.binomtest(high_poverty_spills, total_spills, 0.20, alternative='greater')
poverty_p = poverty_test.pvalue
print(f" Binomial test p-value: {poverty_p:.6f}")
print(f" Significant over-representation: {'YES' if poverty_p < 0.001 else 'NO'}")
# 3. Major Spills Analysis
major_spills = df['More than five barrels spilled'].astype(str) == 'Y'
# Test if major spills disproportionately affect high-poverty areas
high_pov_major = df[high_poverty & major_spills].shape[0]
high_pov_total = high_poverty.sum()
low_pov_major = df[~high_poverty & major_spills].shape[0]
low_pov_total = (~high_poverty).sum()
# Two-proportion z-test
counts = np.array([high_pov_major, low_pov_major])
nobs = np.array([high_pov_total, low_pov_total])
z_stat, p_major = proportions_ztest(counts, nobs)
print(f"\nMajor Spills in High-Poverty Areas:")
print(f" High poverty major spill rate: {high_pov_major/high_pov_total:.3f}")
print(f" Low poverty major spill rate: {low_pov_major/low_pov_total:.3f}")
print(f" Z-statistic: {z_stat:.3f}")
print(f" p-value: {p_major:.6f}")
print(f" Significant difference: {'YES' if p_major < 0.05 else 'NO'}")
# 4. Racial Demographics
minority_communities = df['percent_white'] < 70
minority_spills = minority_communities.sum()
# Assuming 30% of areas are minority communities (rough US average)
expected_minority = 0.30 * total_spills
print(f"\nRacial Demographics Analysis:")
print(f" Minority community spills: {minority_spills}")
print(f" Expected (if random): {expected_minority:.0f}")
print(f" Ratio: {minority_spills / expected_minority:.2f}x")
minority_test = stats.binomtest(minority_spills, total_spills, 0.30, alternative='greater')
minority_p = minority_test.pvalue
print(f" Binomial test p-value: {minority_p:.6f}")
print(f" Significant over-representation: {'YES' if minority_p < 0.05 else 'NO'}")
results = {
'income_chi2': {'statistic': chi2_income, 'p_value': p_income},
'poverty_binomial': {'p_value': poverty_p, 'observed_ratio': high_poverty_spills / expected_high_poverty},
'major_spills_ztest': {'z_statistic': z_stat, 'p_value': p_major},
'minority_binomial': {'p_value': minority_p, 'observed_ratio': minority_spills / expected_minority}
}
return results
def spatial_analysis(df):
"""Perform spatial analysis of spill patterns"""
print("\nSPATIAL ANALYSIS")
print("="*50)
# Create GeoDataFrame
gdf = gpd.GeoDataFrame(
df,
geometry=gpd.points_from_xy(df['Longitude'], df['Latitude']),
crs='EPSG:4326'
)
# Project to Colorado State Plane (meters) for distance calculations
gdf_proj = gdf.to_crs('EPSG:3857') # Web Mercator for general analysis
# 1. Spatial Clustering Analysis (DBSCAN)
coords = np.column_stack([gdf_proj.geometry.x, gdf_proj.geometry.y])
# DBSCAN clustering directly on projected coordinates (meters)
# eps is approximately 1km
eps = 1000
min_samples = 10
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
clusters = dbscan.fit_predict(coords)
gdf['cluster'] = clusters
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise = list(clusters).count(-1)
print(f"Spatial Clustering Results:")
print(f" Number of clusters: {n_clusters}")
print(f" Number of noise points: {n_noise}")
print(f" Clustered points: {len(gdf) - n_noise}")
# 2. Moran's I for spatial autocorrelation
if len(gdf) > 100: # Only if we have enough points
# Remove any rows with missing values for spatial analysis
gdf_spatial = gdf.dropna(subset=['percent_poverty', 'median_household_income'])
if len(gdf_spatial) > 100:
# Create spatial weights (K-nearest neighbors)
coords_array = np.column_stack([gdf_spatial.geometry.x, gdf_spatial.geometry.y])
w = KNN.from_array(coords_array, k=min(8, len(gdf_spatial)-1))
w.transform = 'r' # Row standardization
# Test spatial autocorrelation of poverty rates
try:
moran_poverty = esda.Moran(gdf_spatial['percent_poverty'], w)
print(f"\nSpatial Autocorrelation (Moran's I):")
print(f" Poverty rate Moran's I: {moran_poverty.I:.4f}")
print(f" p-value: {moran_poverty.p_sim:.4f}")
print(f" Significant clustering: {'YES' if moran_poverty.p_sim < 0.05 else 'NO'}")
# Test for income
moran_income = esda.Moran(gdf_spatial['median_household_income'], w)
print(f" Income Moran's I: {moran_income.I:.4f}")
print(f" p-value: {moran_income.p_sim:.4f}")
# LISA analysis for local clusters
lisa_poverty = esda.Moran_Local(gdf_spatial['percent_poverty'], w)
# Count significant LISA clusters
significant_clusters = np.sum(lisa_poverty.p_sim < 0.05)
print(f" Significant local poverty clusters: {significant_clusters}")
except Exception as e:
print(f" Spatial autocorrelation analysis failed: {e}")
else:
print(f" Insufficient valid spatial data: {len(gdf_spatial)} points")
# 3. Hotspot Analysis
# Create grid and count spills per cell
xmin, ymin, xmax, ymax = gdf_proj.total_bounds
# Create 5km x 5km grid
grid_size = 5000 # 5km in meters
x_coords = np.arange(xmin, xmax + grid_size, grid_size)
y_coords = np.arange(ymin, ymax + grid_size, grid_size)
spill_density = calculate_spill_density(gdf_proj, x_coords, y_coords, grid_size)
print(f"\nHotspot Analysis:")
print(f" Grid cells created: {len(spill_density)}")
if len(spill_density) > 0:
print(f" Max spills per 5km cell: {spill_density['spill_count'].max()}")
print(f" Mean spills per cell: {spill_density['spill_count'].mean():.2f}")
else:
print(" No grid cells with spills found")
return gdf, spill_density, n_clusters
def calculate_spill_density(gdf_proj, x_coords, y_coords, grid_size):
"""Calculate spill density on a grid"""
density_data = []
for i, x in enumerate(x_coords[:-1]):
for j, y in enumerate(y_coords[:-1]):
# Define grid cell bounds
cell_bounds = (x, y, x + grid_size, y + grid_size)
# Count spills in this cell
mask = (
(gdf_proj.geometry.x >= cell_bounds[0]) &
(gdf_proj.geometry.x < cell_bounds[2]) &
(gdf_proj.geometry.y >= cell_bounds[1]) &
(gdf_proj.geometry.y < cell_bounds[3])
)
spills_in_cell = gdf_proj[mask]
if len(spills_in_cell) > 0:
density_data.append({
'grid_x': x + grid_size/2,
'grid_y': y + grid_size/2,
'spill_count': len(spills_in_cell),
'avg_poverty': spills_in_cell['percent_poverty'].mean(),
'avg_income': spills_in_cell['median_household_income'].mean(),
'major_spills': (spills_in_cell['More than five barrels spilled'].astype(str) == 'Y').sum()
})
return pd.DataFrame(density_data)
def spatial_regression_analysis(gdf):
"""Perform spatial regression to control for location effects"""
print("\nSPATIAL REGRESSION ANALYSIS")
print("="*50)
# Create variables for regression
gdf_reg = gdf.copy()
gdf_reg['major_spill'] = (gdf_reg['More than five barrels spilled'].astype(str) == 'Y').astype(int)
gdf_reg['high_poverty'] = (gdf_reg['percent_poverty'] > 15).astype(int)
gdf_reg['minority_community'] = (gdf_reg['percent_white'] < 70).astype(int)
# Add spatial controls (distance to urban centers, etc.)
# For now, use lat/lon as proxies for spatial effects
gdf_reg['lat_norm'] = (gdf_reg['Latitude'] - gdf_reg['Latitude'].mean()) / gdf_reg['Latitude'].std()
gdf_reg['lon_norm'] = (gdf_reg['Longitude'] - gdf_reg['Longitude'].mean()) / gdf_reg['Longitude'].std()
# OLS regression: Major spill probability ~ demographics + spatial controls
model_formula = 'major_spill ~ percent_poverty + percent_white + median_household_income + lat_norm + lon_norm'
try:
model = ols(model_formula, data=gdf_reg).fit()
print("Regression Results (Major Spill Probability):")
print(f" R-squared: {model.rsquared:.4f}")
print(f" F-statistic p-value: {model.f_pvalue:.6f}")
# Key coefficients
coef_poverty = model.params.get('percent_poverty', 0)
pval_poverty = model.pvalues.get('percent_poverty', 1)
coef_white = model.params.get('percent_white', 0)
pval_white = model.pvalues.get('percent_white', 1)
coef_income = model.params.get('median_household_income', 0)
pval_income = model.pvalues.get('median_household_income', 1)
print(f"\nKey Findings:")
print(f" Poverty rate coefficient: {coef_poverty:.6f} (p={pval_poverty:.4f})")
print(f" White percentage coefficient: {coef_white:.6f} (p={pval_white:.4f})")
print(f" Income coefficient: {coef_income:.8f} (p={pval_income:.4f})")
return model
except Exception as e:
print(f"Regression analysis failed: {e}")
return None
def generate_spatial_statistical_report(stats_results, spatial_results, model_results):
"""Generate comprehensive report using LLM"""
summary_text = f"""
STATISTICAL AND SPATIAL ANALYSIS SUMMARY:
STATISTICAL SIGNIFICANCE TESTS:
- Income distribution chi-square p-value: {stats_results['income_chi2']['p_value']:.6f}
- Poverty over-representation ratio: {stats_results['poverty_binomial']['observed_ratio']:.2f}x
- Poverty binomial test p-value: {stats_results['poverty_binomial']['p_value']:.6f}
- Major spills z-test p-value: {stats_results['major_spills_ztest']['p_value']:.6f}
- Minority community ratio: {stats_results['minority_binomial']['observed_ratio']:.2f}x
SPATIAL ANALYSIS:
- Number of spatial clusters identified: {spatial_results['n_clusters']}
- Spatial autocorrelation detected in poverty patterns
- Hotspots identified with up to {spatial_results.get('max_density', 'N/A')} spills per 5km grid
REGRESSION FINDINGS:
- Spatial controls included to account for facility locations
- Multiple demographic variables tested simultaneously
- Results control for geographic clustering effects
"""
prompt = f"""
Based on this comprehensive statistical and spatial analysis of oil and gas spills, provide an academic-level interpretation of the environmental justice implications.
Analysis Results:
{summary_text}
Focus on:
1. Statistical significance of demographic disparities
2. Spatial clustering patterns and their implications
3. Whether disparities persist after controlling for spatial effects
4. Methodological strengths and limitations
5. Policy implications for environmental justice
6. Recommendations for further research
Format as a rigorous academic discussion suitable for a public policy journal, emphasizing both statistical rigor and practical policy relevance.
"""
return query_ollama(prompt)
def create_visualizations(gdf, spill_density):
"""Create key visualizations"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# 1. Spill locations by poverty rate
ax1 = axes[0, 0]
scatter = ax1.scatter(gdf['Longitude'], gdf['Latitude'],
c=gdf['percent_poverty'], cmap='Reds',
alpha=0.6, s=10)
ax1.set_title('Spill Locations by Poverty Rate')
ax1.set_xlabel('Longitude')
ax1.set_ylabel('Latitude')
plt.colorbar(scatter, ax=ax1, label='Poverty Rate (%)')
# 2. Income distribution
ax2 = axes[0, 1]
income_quartiles = pd.qcut(gdf['median_household_income'], 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
income_counts = gdf.groupby(income_quartiles).size()
ax2.bar(income_counts.index, income_counts.values)
ax2.set_title('Spills by Income Quartile')
ax2.set_xlabel('Income Quartile')
ax2.set_ylabel('Number of Spills')
# 3. Major spills by demographics
ax3 = axes[1, 0]
demo_data = pd.DataFrame({
'High Poverty': [
len(gdf[(gdf['percent_poverty'] > 15) & (gdf['More than five barrels spilled'].astype(str) == 'Y')]),
len(gdf[(gdf['percent_poverty'] > 15) & (gdf['More than five barrels spilled'].astype(str) != 'Y')])
],
'Low Poverty': [
len(gdf[(gdf['percent_poverty'] <= 15) & (gdf['More than five barrels spilled'].astype(str) == 'Y')]),
len(gdf[(gdf['percent_poverty'] <= 15) & (gdf['More than five barrels spilled'].astype(str) != 'Y')])
]
}, index=['Major Spills', 'Minor Spills'])
demo_data.plot(kind='bar', ax=ax3, stacked=True)
ax3.set_title('Spill Severity by Poverty Level')
ax3.set_xlabel('Spill Type')
ax3.set_ylabel('Count')
ax3.legend(title='Community Type')
# 4. Spatial density
ax4 = axes[1, 1]
if len(spill_density) > 0:
scatter2 = ax4.scatter(spill_density['grid_x'], spill_density['grid_y'],
c=spill_density['spill_count'], cmap='YlOrRd',
s=spill_density['spill_count']*10, alpha=0.7)
ax4.set_title('Spill Density Hotspots (5km Grid)')
ax4.set_xlabel('X Coordinate (Projected)')
ax4.set_ylabel('Y Coordinate (Projected)')
plt.colorbar(scatter2, ax=ax4, label='Spills per Cell')
plt.tight_layout()
plt.savefig('environmental_justice_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Main execution
def run_comprehensive_analysis(csv_file):
"""Run complete statistical and spatial analysis"""
print("COMPREHENSIVE STATISTICAL & SPATIAL ENVIRONMENTAL JUSTICE ANALYSIS")
print("="*80)
# Load data
df = pd.read_csv(csv_file)
print(f"Loaded {len(df)} spill incidents")
# Statistical analysis
stats_results = statistical_disparity_tests(df)
# Spatial analysis
gdf, spill_density, n_clusters = spatial_analysis(df)
# Spatial regression
model = spatial_regression_analysis(gdf)
# Create visualizations
create_visualizations(gdf, spill_density)
# Generate comprehensive report
spatial_results = {'n_clusters': n_clusters}
if len(spill_density) > 0:
spatial_results['max_density'] = spill_density['spill_count'].max()
model_summary = str(model.summary()) if model else "Regression analysis not available"
report = generate_spatial_statistical_report(stats_results, spatial_results, model_summary)
# Save results
results = {
'statistical_tests': stats_results,
'spatial_analysis': spatial_results,
'regression_summary': model_summary,
'academic_interpretation': report
}
with open('statistical_spatial_analysis.json', 'w') as f:
json.dump(results, f, indent=2, default=str)
with open('academic_report.txt', 'w') as f:
f.write(report)
print(f"\nAnalysis complete. Results saved to:")
print(f" - statistical_spatial_analysis.json")
print(f" - academic_report.txt")
print(f" - environmental_justice_analysis.png")
return results
if __name__ == "__main__":
results = run_comprehensive_analysis('spills_with_demographics.csv')

View File

@@ -1,66 +0,0 @@
COMPREHENSIVE STATISTICAL & SPATIAL ENVIRONMENTAL JUSTICE ANALYSIS
================================================================================
Loaded 16890 spill incidents
STATISTICAL SIGNIFICANCE TESTS
==================================================
Income Distribution Test:
Chi-square statistic: 361.694
p-value: 0.000000
Significant disparity: YES
Poverty Analysis:
High-poverty spills: 3497
Expected (if random): 3378
Ratio: 1.04x
Binomial test p-value: 0.011556
Significant over-representation: NO
Major Spills in High-Poverty Areas:
High poverty major spill rate: 0.369
Low poverty major spill rate: 0.269
Z-statistic: 11.598
p-value: 0.000000
Significant difference: YES
Racial Demographics Analysis:
Minority community spills: 1047
Expected (if random): 5067
Ratio: 0.21x
Binomial test p-value: 1.000000
Significant over-representation: NO
SPATIAL ANALYSIS
==================================================
Spatial Clustering Results:
Number of clusters: 259
Number of noise points: 4749
Clustered points: 12141
Spatial Autocorrelation (Moran's I):
Poverty rate Moran's I: 0.9714
p-value: 0.0010
Significant clustering: YES
Income Moran's I: 0.9585
p-value: 0.0010
Significant local poverty clusters: 9209
Hotspot Analysis:
Grid cells created: 1189
Max spills per 5km cell: 119
Mean spills per cell: 14.21
SPATIAL REGRESSION ANALYSIS
==================================================
Regression Results (Major Spill Probability):
R-squared: 0.0547
F-statistic p-value: 0.000000
Key Findings:
Poverty rate coefficient: 0.009572 (p=0.0000)
White percentage coefficient: 0.004621 (p=0.0000)
Income coefficient: -0.00000098 (p=0.0000)
Analysis complete. Results saved to:
- statistical_spatial_analysis.json
- academic_report.txt
- environmental_justice_analysis.png

View File

@@ -1,307 +0,0 @@
import pandas as pd
import requests
import json
from collections import Counter, defaultdict
import numpy as np
def query_ollama(prompt, model="mistral"):
"""Send query to local Ollama instance"""
try:
response = requests.post('http://localhost:11434/api/generate',
json={
'model': model,
'prompt': prompt,
'stream': False
})
return response.json()['response']
except Exception as e:
print(f"Error querying Ollama: {e}")
return None
def analyze_spill_demographics(df):
"""Analyze demographic patterns in spill data"""
# Basic demographic statistics
demo_stats = {
'total_spills': len(df),
'avg_median_income': df['median_household_income'].mean(),
'avg_poverty_rate': df['percent_poverty'].mean(),
'avg_white_percentage': df['percent_white'].mean(),
'avg_hispanic_percentage': df['percent_hispanic'].mean(),
'avg_unemployment': df['unemployment_rate'].mean()
}
# Environmental justice analysis
# Define high-poverty communities (>15% poverty rate)
high_poverty = df[df['percent_poverty'] > 15]
low_poverty = df[df['percent_poverty'] <= 15]
# Define minority communities (>30% non-white)
minority_communities = df[df['percent_white'] < 70]
white_communities = df[df['percent_white'] >= 70]
# Convert spill volumes to numeric, handling 'Unknown' values
produced_water_numeric = pd.to_numeric(df['Produced Water Spill Volume'], errors='coerce')
high_poverty_volumes = pd.to_numeric(high_poverty['Produced Water Spill Volume'], errors='coerce')
ej_analysis = {
'high_poverty_spills': len(high_poverty),
'high_poverty_avg_volume': high_poverty_volumes.sum(),
'minority_community_spills': len(minority_communities),
'spills_by_income_quartile': df.groupby(pd.qcut(df['median_household_income'], 4, labels=['Q1(Lowest)', 'Q2', 'Q3', 'Q4(Highest)'])).size().to_dict(),
'major_spills_by_poverty': {
'high_poverty_major': len(high_poverty[high_poverty['More than five barrels spilled'] == 'Y']),
'low_poverty_major': len(low_poverty[low_poverty['More than five barrels spilled'] == 'Y'])
}
}
return demo_stats, ej_analysis
def analyze_root_causes(df):
"""Analyze already-categorized root causes"""
# Count existing cause categories, handling NaN values
cause_counts = {
'human_error': df['Human Error'].fillna(0).sum(),
'equipment_failure': df['Equipment Failure'].fillna(0).sum(),
'historical_unknown': df['Historical Unkown'].fillna(0).sum(), # Note: typo in original data
'other': df['Other'].fillna(0).sum()
}
# Get specific root cause descriptions
root_causes = df['Root Cause'].dropna().value_counts().head(10)
return cause_counts, root_causes
def analyze_spill_themes_llm(df, sample_size=50):
"""Use LLM to analyze themes in spill descriptions"""
# Sample descriptions for LLM analysis (to avoid overwhelming it)
descriptions_series = df['Spill Description'].dropna()
if len(descriptions_series) == 0:
return "No spill descriptions available for analysis."
sample_descriptions = descriptions_series.sample(min(sample_size, len(descriptions_series))).tolist()
# Combine descriptions for batch analysis
combined_text = "\n---\n".join(sample_descriptions)
prompt = f"""
Analyze these oil and gas spill incident descriptions to identify themes and patterns.
Focus on:
1. Common equipment failures (tanks, valves, pipelines, etc.)
2. Operational issues (overflow, leaks, maintenance problems)
3. Environmental factors (weather, terrain, wildlife)
4. Human factors (operator error, maintenance issues)
5. Discovery methods (routine inspection, alarms, third-party reports)
6. Spill severity indicators
Incident descriptions:
{combined_text}
Provide a structured analysis with:
- Top 5 equipment failure patterns
- Most common operational issues
- Environmental risk factors
- Human factor patterns
- Recommendations for prevention based on these patterns
Format as a concise regulatory summary suitable for policy recommendations.
"""
return query_ollama(prompt)
def demographic_spill_analysis(df):
"""Analyze spill patterns by demographic characteristics"""
# Create demographic categories
df_analysis = df.copy()
df_analysis['income_category'] = pd.cut(df_analysis['median_household_income'],
bins=3, labels=['Low Income', 'Middle Income', 'High Income'])
df_analysis['poverty_category'] = pd.cut(df_analysis['percent_poverty'],
bins=[0, 10, 20, 100], labels=['Low Poverty', 'Moderate Poverty', 'High Poverty'])
df_analysis['race_category'] = df_analysis['percent_white'].apply(
lambda x: 'Majority White' if x >= 70 else 'Minority Community'
)
# Analyze spill patterns by demographics
demo_patterns = {
'spills_by_income': df_analysis.groupby('income_category').size().to_dict(),
'spills_by_poverty': df_analysis.groupby('poverty_category').size().to_dict(),
'spills_by_race': df_analysis.groupby('race_category').size().to_dict(),
'volume_by_demographics': {
'high_poverty_major_spills': len(df_analysis[(df_analysis['percent_poverty'] > 15) &
(df_analysis['More than five barrels spilled'].astype(str) == 'Y')]),
'minority_major_spills': len(df_analysis[(df_analysis['percent_white'] < 70) &
(df_analysis['More than five barrels spilled'].astype(str) == 'Y')])
}
}
return demo_patterns
def analyze_environmental_justice(df, sample_descriptions=20):
"""Use LLM to analyze environmental justice implications"""
# Get descriptions from high-poverty and minority communities
high_poverty_desc = df[df['percent_poverty'] > 15]['Spill Description'].dropna()
minority_desc = df[df['percent_white'] < 70]['Spill Description'].dropna()
if len(high_poverty_desc) == 0 or len(minority_desc) == 0:
return "Insufficient data for environmental justice analysis."
high_poverty_spills = high_poverty_desc.sample(min(sample_descriptions//2, len(high_poverty_desc))).tolist()
minority_spills = minority_desc.sample(min(sample_descriptions//2, len(minority_desc))).tolist()
combined_ej_text = "\n---HIGH POVERTY AREA---\n".join(high_poverty_spills) + "\n---MINORITY COMMUNITY---\n".join(minority_spills)
prompt = f"""
Analyze these spill incidents from high-poverty and minority communities for environmental justice concerns.
Consider:
1. Severity of incidents in vulnerable communities
2. Response effectiveness and cleanup completion
3. Long-term environmental impacts
4. Patterns that might indicate disproportionate impacts
5. Regulatory compliance and enforcement patterns
Spill descriptions:
{combined_ej_text}
Provide an environmental justice assessment focusing on:
- Whether vulnerable communities face more severe incidents
- Quality of response and remediation
- Policy recommendations for equitable environmental protection
"""
return query_ollama(prompt)
def comprehensive_spill_analysis(csv_file):
"""Run complete analysis of spill data"""
print("Loading spill data...")
df = pd.read_csv(csv_file)
print(f"Analyzing {len(df)} spill incidents...")
# Basic demographic analysis
demo_stats, ej_analysis = analyze_spill_demographics(df)
# Root cause analysis (using existing categorizations)
cause_counts, root_causes = analyze_root_causes(df)
# Demographic patterns
demo_patterns = demographic_spill_analysis(df)
# LLM-based theme analysis
print("Running LLM analysis on spill descriptions...")
theme_analysis = analyze_spill_themes_llm(df, sample_size=100)
# Environmental justice analysis
print("Analyzing environmental justice implications...")
ej_llm_analysis = analyze_environmental_justice(df, sample_descriptions=30)
# Compile comprehensive results
results = {
'summary_statistics': {
'total_incidents': len(df),
'date_range': f"{df['Date of Discovery'].min()} to {df['Date of Discovery'].max()}",
'counties_affected': df['county'].nunique(),
'operators_involved': df['Operator'].nunique()
},
'demographic_statistics': demo_stats,
'environmental_justice_analysis': ej_analysis,
'root_cause_analysis': {
'cause_counts': cause_counts,
'top_root_causes': root_causes.to_dict()
},
'demographic_patterns': demo_patterns,
'llm_theme_analysis': theme_analysis,
'llm_environmental_justice': ej_llm_analysis
}
return results
def generate_policy_report(results):
"""Generate policy-focused summary using LLM"""
# Create summary for LLM to process
summary_text = f"""
SPILL DATA ANALYSIS SUMMARY:
Total Incidents: {results['summary_statistics']['total_incidents']}
Date Range: {results['summary_statistics']['date_range']}
DEMOGRAPHIC PATTERNS:
- Average poverty rate in affected areas: {results['demographic_statistics']['avg_poverty_rate']:.1f}%
- Average income: ${results['demographic_statistics']['avg_median_income']:,.0f}
- Spills in high-poverty areas: {results['environmental_justice_analysis']['high_poverty_spills']}
- Spills in minority communities: {results['environmental_justice_analysis']['minority_community_spills']}
ROOT CAUSES:
- Equipment failures: {results['root_cause_analysis']['cause_counts']['equipment_failure']}
- Human error: {results['root_cause_analysis']['cause_counts']['human_error']}
- Historical/unknown: {results['root_cause_analysis']['cause_counts']['historical_unknown']}
THEME ANALYSIS:
{results['llm_theme_analysis']}
ENVIRONMENTAL JUSTICE ANALYSIS:
{results['llm_environmental_justice']}
"""
policy_prompt = f"""
Based on this comprehensive spill data analysis, create a policy-focused executive summary.
Data Summary:
{summary_text}
Provide:
1. Key findings on environmental justice impacts
2. Priority areas for regulatory attention
3. Specific policy recommendations for prevention
4. Recommendations for equitable enforcement
5. Suggested regulatory changes based on patterns identified
Format as an executive summary suitable for regulatory decision-makers and policy researchers.
"""
return query_ollama(policy_prompt)
# Execute comprehensive analysis
if __name__ == "__main__":
# Run the analysis
results = comprehensive_spill_analysis('spills_with_demographics.csv')
# Generate policy report
print("\nGenerating policy-focused summary...")
policy_report = generate_policy_report(results)
# Save all results
with open('comprehensive_spill_analysis.json', 'w') as f:
json.dump(results, f, indent=2, default=str)
with open('policy_executive_summary.txt', 'w') as f:
f.write(policy_report)
# Print key findings
print("\n" + "="*60)
print("COMPREHENSIVE SPILL ANALYSIS COMPLETE")
print("="*60)
print(f"\nTotal incidents analyzed: {results['summary_statistics']['total_incidents']:,}")
print(f"Counties affected: {results['summary_statistics']['counties_affected']}")
print(f"Average poverty rate in affected areas: {results['demographic_statistics']['avg_poverty_rate']:.1f}%")
print(f"Spills in high-poverty communities: {results['environmental_justice_analysis']['high_poverty_spills']:,}")
print(f"Spills in minority communities: {results['environmental_justice_analysis']['minority_community_spills']:,}")
print(f"\nRoot cause breakdown:")
for cause, count in results['root_cause_analysis']['cause_counts'].items():
print(f" {cause.replace('_', ' ').title()}: {count:,}")
print(f"\nResults saved to:")
print(f" - comprehensive_spill_analysis.json (detailed data)")
print(f" - policy_executive_summary.txt (executive summary)")
print(f"\nPolicy Summary Preview:")
print("="*40)
print(policy_report[:500] + "...")