we have an EJ story! and a good one!

2025-07-10 13:22:35 -07:00
parent da4358728d
commit afff5eebd4
27 changed files with 7398 additions and 190 deletions
--- a/data/.~lock.spills_with_demographics.csv#
+++ b/data/.~lock.spills_with_demographics.csv#
@@ -1 +0,0 @@
-David Adams,dadams,thinkingdead,04.07.2025 23:33,file:///home/dadams/.config/libreoffice/4;
--- a/data/comprehensive_spill_analysis.json
+++ b/data/comprehensive_spill_analysis.json
@@ -1,73 +0,0 @@
-{
-  "summary_statistics": {
-    "total_incidents": 16890,
-    "date_range": "1994-11-14 to 2024-06-15",
-    "counties_affected": 33,
-    "operators_involved": 296
-  },
-  "demographic_statistics": {
-    "total_spills": 16890,
-    "avg_median_income": 79281.58957963291,
-    "avg_poverty_rate": 10.344773143016967,
-    "avg_white_percentage": 83.5093530389343,
-    "avg_hispanic_percentage": 22.542174310346685,
-    "avg_unemployment": 2.652711938767639
-  },
-  "environmental_justice_analysis": {
-    "high_poverty_spills": 3497,
-    "high_poverty_avg_volume": 0.0,
-    "minority_community_spills": 1047,
-    "spills_by_income_quartile": {
-      "Q1(Lowest)": 5244,
-      "Q2": 3814,
-      "Q3": 4170,
-      "Q4(Highest)": 3662
-    },
-    "major_spills_by_poverty": {
-      "high_poverty_major": 1289,
-      "low_poverty_major": 3599
-    }
-  },
-  "root_cause_analysis": {
-    "cause_counts": {
-      "human_error": 684.0,
-      "equipment_failure": 2023.0,
-      "historical_unknown": 805.0,
-      "other": 175.0
-    },
-    "top_root_causes": {
-      "Historical impacts were discovered during flowline decommissioning activities.": 204,
-      "Historical impacts were discovered during tank battery decommissioning activities.": 187,
-      "Historical impacts were discovered during wellhead cut and cap activities.": 160,
-      "Historically impacted soils were discovered following cut and cap operations at the wellhead.": 61,
-      "Unknown": 60,
-      "Historical impacts were discovered following cut and cap operations at the wellhead.": 56,
-      "Historically impacted soils were discovered following facility decommissioning operations at the facility.": 34,
-      "Historical impacts were discovered during tank battery dismantlement.": 30,
-      "A root cause cannot be determined since this release is considered historical.": 27,
-      "Historical impacts were discovered following facility decommissioning operations at the facility.": 21
-    }
-  },
-  "demographic_patterns": {
-    "spills_by_income": {
-      "Low Income": 11888,
-      "Middle Income": 4255,
-      "High Income": 747
-    },
-    "spills_by_poverty": {
-      "Low Poverty": 9668,
-      "Moderate Poverty": 4181,
-      "High Poverty": 2882
-    },
-    "spills_by_race": {
-      "Majority White": 15839,
-      "Minority Community": 1051
-    },
-    "volume_by_demographics": {
-      "high_poverty_major_spills": 1289,
-      "minority_major_spills": 314
-    }
-  },
-  "llm_theme_analysis": " Title: Regulatory Summary for Equipment Maintenance, Operational Improvements, and Environmental Protection in Oil and Gas Operations\n\n1. Equipment Failure Patterns:\n   - Gasket failures (Check valves, wellheads)\n   - Ball valve failures (Wellheads, tanks)\n   - Needle valve failures (Wellheads, tanks)\n   - Frozen valves (Wellheads, tanks)\n   - Transfer hose ruptures (Water haulers)\n\n2. Most Common Operational Issues:\n   - Inadequate maintenance and inspection of equipment parts\n   - Poor weather conditions affecting valve functionality\n   - Human error during operation and maintenance activities\n   - Lack of proper training for operators\n   - Insufficient response time in detecting and addressing leaks or spills\n\n3. Environmental Risk Factors:\n   - Contamination of soil and groundwater from spills or leaks\n   - Impact on local ecosystems due to oil and water release\n   - Potential harm to wildlife and other flora and fauna\n   - Increased greenhouse gas emissions as a result of operational inefficiencies\n\n4. Human Factor Patterns:\n   - Lack of awareness and adherence to safety protocols\n   - Insufficient communication and coordination among team members\n   - Inadequate supervision and oversight during critical tasks\n   - Worker fatigue or distraction leading to errors\n   - Limited access to proper tools, resources, and equipment for maintenance and repairs\n\n5. Recommendations for Prevention:\n   - Implement regular equipment inspections and maintenance schedules\n   - Train operators on proper operation, maintenance, and emergency response procedures\n   - Ensure that equipment is winterized or protected against harsh weather conditions\n   - Develop clear communication protocols among team members and with third parties\n   - Provide adequate resources, tools, and safety equipment to workers for safe and efficient operations.",
-  "llm_environmental_justice": " Environmental Justice Assessment:\n\n1. Vulnerable Communities and Severe Incidents:\n   From the provided data, it appears that there is a higher concentration of oil and gas facilities in the areas designated as \"minority communities\" or near historically impacted sites. This suggests that these communities may indeed face more severe incidents due to the proximity of these facilities. For example, the Small Eyed 14C-35HZ well and Carter Keith A UN 2 O SA production facility are located in areas designated as \"minority communities\" and have reported incidents. However, it is essential to note that this analysis is based on a small dataset and may not fully represent the broader picture. Further research would be necessary to confirm this trend and understand its underlying causes.\n\n2. Quality of Response and Remediation:\n   The response time for reporting incidents seems generally prompt in most cases, with remedial actions such as soil sampling and cleanup following shortly after. However, it is not clear from the provided data whether the quality of these responses varies between majority and minority communities. It would be beneficial to investigate this further, perhaps by comparing incident response times and remediation outcomes across different community types.\n\n3. Policy Recommendations for Equitable Environmental Protection:\n   To ensure equitable environmental protection for all communities, policy recommendations could include:\n\n   a) Strengthening the enforcement of regulations governing oil and gas facilities in vulnerable communities to minimize potential incidents.\n\n   b) Increasing community engagement and education on their rights, risks, and responsibilities related to oil and gas operations near their neighborhoods.\n\n   c) Providing resources for independent environmental monitoring in these communities to facilitate early detection of incidents and improved response times.\n\n   d) Prioritizing the development of green infrastructure and renewable energy projects in historically impacted areas as a means of transitioning away from fossil fuel reliance and reducing exposure to associated risks.\n\n   e) Establishing funding mechanisms specifically designed to support environmental cleanup efforts in vulnerable communities affected by historical oil and gas operations.\n\n   f) Implementing stricter penalties for companies found guilty of environmental violations, particularly those occurring in areas where vulnerable populations reside."
-}
--- a/data/environmental_justice_analysis.png
+++ b/data/environmental_justice_analysis.png
--- a/data/social_captial_index_county.csv
+++ b/data/social_captial_index_county.csv
--- a/data/spatial_statistical_analysis.py
+++ b/data/spatial_statistical_analysis.py
@@ -1,450 +0,0 @@
-import pandas as pd
-import geopandas as gpd
-import numpy as np
-from scipy import stats
-from scipy.spatial.distance import cdist
-import matplotlib.pyplot as plt
-import seaborn as sns
-from sklearn.cluster import DBSCAN
-import esda
-from libpysal.weights import Queen, KNN
-from splot.esda import moran_scatterplot, lisa_cluster
-import requests
-import json
-from statsmodels.stats.proportion import proportions_ztest
-from statsmodels.formula.api import ols
-import contextily as ctx
-import warnings
-warnings.filterwarnings('ignore')
-
-def query_ollama(prompt, model="mistral"):
-    """Send query to local Ollama instance"""
-    try:
-        response = requests.post('http://localhost:11434/api/generate',
-            json={
-                'model': model,
-                'prompt': prompt,
-                'stream': False
-            })
-        return response.json()['response']
-    except Exception as e:
-        print(f"Error querying Ollama: {e}")
-        return None
-
-def statistical_disparity_tests(df):
-    """Perform statistical tests for environmental justice disparities"""
-    
-    print("STATISTICAL SIGNIFICANCE TESTS")
-    print("="*50)
-    
-    results = {}
-    
-    # 1. Income Quartile Analysis
-    income_quartiles = pd.qcut(df['median_household_income'], 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
-    spill_counts = df.groupby(income_quartiles).size()
-    
-    # Chi-square test for income distribution
-    expected_per_quartile = len(df) / 4
-    chi2_income, p_income = stats.chisquare(spill_counts, f_exp=[expected_per_quartile] * 4)
-    
-    print(f"Income Distribution Test:")
-    print(f"  Chi-square statistic: {chi2_income:.3f}")
-    print(f"  p-value: {p_income:.6f}")
-    print(f"  Significant disparity: {'YES' if p_income < 0.001 else 'NO'}")
-    
-    # 2. Poverty Rate Analysis
-    high_poverty = df['percent_poverty'] > 15
-    high_poverty_spills = high_poverty.sum()
-    total_spills = len(df)
-    
-    # Assuming 20% of census tracts are high poverty (national average)
-    expected_high_poverty = 0.20 * total_spills
-    
-    print(f"\nPoverty Analysis:")
-    print(f"  High-poverty spills: {high_poverty_spills}")
-    print(f"  Expected (if random): {expected_high_poverty:.0f}")
-    print(f"  Ratio: {high_poverty_spills / expected_high_poverty:.2f}x")
-    
-    # Binomial test
-    poverty_test = stats.binomtest(high_poverty_spills, total_spills, 0.20, alternative='greater')
-    poverty_p = poverty_test.pvalue
-    print(f"  Binomial test p-value: {poverty_p:.6f}")
-    print(f"  Significant over-representation: {'YES' if poverty_p < 0.001 else 'NO'}")
-    
-    # 3. Major Spills Analysis
-    major_spills = df['More than five barrels spilled'].astype(str) == 'Y'
-    
-    # Test if major spills disproportionately affect high-poverty areas
-    high_pov_major = df[high_poverty & major_spills].shape[0]
-    high_pov_total = high_poverty.sum()
-    low_pov_major = df[~high_poverty & major_spills].shape[0]
-    low_pov_total = (~high_poverty).sum()
-    
-    # Two-proportion z-test
-    counts = np.array([high_pov_major, low_pov_major])
-    nobs = np.array([high_pov_total, low_pov_total])
-    z_stat, p_major = proportions_ztest(counts, nobs)
-    
-    print(f"\nMajor Spills in High-Poverty Areas:")
-    print(f"  High poverty major spill rate: {high_pov_major/high_pov_total:.3f}")
-    print(f"  Low poverty major spill rate: {low_pov_major/low_pov_total:.3f}")
-    print(f"  Z-statistic: {z_stat:.3f}")
-    print(f"  p-value: {p_major:.6f}")
-    print(f"  Significant difference: {'YES' if p_major < 0.05 else 'NO'}")
-    
-    # 4. Racial Demographics
-    minority_communities = df['percent_white'] < 70
-    minority_spills = minority_communities.sum()
-    
-    # Assuming 30% of areas are minority communities (rough US average)
-    expected_minority = 0.30 * total_spills
-    
-    print(f"\nRacial Demographics Analysis:")
-    print(f"  Minority community spills: {minority_spills}")
-    print(f"  Expected (if random): {expected_minority:.0f}")
-    print(f"  Ratio: {minority_spills / expected_minority:.2f}x")
-    
-    minority_test = stats.binomtest(minority_spills, total_spills, 0.30, alternative='greater')
-    minority_p = minority_test.pvalue
-    print(f"  Binomial test p-value: {minority_p:.6f}")
-    print(f"  Significant over-representation: {'YES' if minority_p < 0.05 else 'NO'}")
-    
-    results = {
-        'income_chi2': {'statistic': chi2_income, 'p_value': p_income},
-        'poverty_binomial': {'p_value': poverty_p, 'observed_ratio': high_poverty_spills / expected_high_poverty},
-        'major_spills_ztest': {'z_statistic': z_stat, 'p_value': p_major},
-        'minority_binomial': {'p_value': minority_p, 'observed_ratio': minority_spills / expected_minority}
-    }
-    
-    return results
-
-def spatial_analysis(df):
-    """Perform spatial analysis of spill patterns"""
-    
-    print("\nSPATIAL ANALYSIS")
-    print("="*50)
-    
-    # Create GeoDataFrame
-    gdf = gpd.GeoDataFrame(
-        df, 
-        geometry=gpd.points_from_xy(df['Longitude'], df['Latitude']),
-        crs='EPSG:4326'
-    )
-    
-    # Project to Colorado State Plane (meters) for distance calculations
-    gdf_proj = gdf.to_crs('EPSG:3857')  # Web Mercator for general analysis
-    
-    # 1. Spatial Clustering Analysis (DBSCAN)
-    coords = np.column_stack([gdf_proj.geometry.x, gdf_proj.geometry.y])
-    
-    # DBSCAN clustering directly on projected coordinates (meters)
-    # eps is approximately 1km
-    eps = 1000
-    min_samples = 10
-
-    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
-    clusters = dbscan.fit_predict(coords)
-    
-    gdf['cluster'] = clusters
-    n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
-    n_noise = list(clusters).count(-1)
-    
-    print(f"Spatial Clustering Results:")
-    print(f"  Number of clusters: {n_clusters}")
-    print(f"  Number of noise points: {n_noise}")
-    print(f"  Clustered points: {len(gdf) - n_noise}")
-    
-    # 2. Moran's I for spatial autocorrelation
-    if len(gdf) > 100:  # Only if we have enough points
-        # Remove any rows with missing values for spatial analysis
-        gdf_spatial = gdf.dropna(subset=['percent_poverty', 'median_household_income'])
-        
-        if len(gdf_spatial) > 100:
-            # Create spatial weights (K-nearest neighbors)
-            coords_array = np.column_stack([gdf_spatial.geometry.x, gdf_spatial.geometry.y])
-            w = KNN.from_array(coords_array, k=min(8, len(gdf_spatial)-1))
-            w.transform = 'r'  # Row standardization
-            
-            # Test spatial autocorrelation of poverty rates
-            try:
-                moran_poverty = esda.Moran(gdf_spatial['percent_poverty'], w)
-                
-                print(f"\nSpatial Autocorrelation (Moran's I):")
-                print(f"  Poverty rate Moran's I: {moran_poverty.I:.4f}")
-                print(f"  p-value: {moran_poverty.p_sim:.4f}")
-                print(f"  Significant clustering: {'YES' if moran_poverty.p_sim < 0.05 else 'NO'}")
-                
-                # Test for income
-                moran_income = esda.Moran(gdf_spatial['median_household_income'], w)
-                print(f"  Income Moran's I: {moran_income.I:.4f}")
-                print(f"  p-value: {moran_income.p_sim:.4f}")
-                
-                # LISA analysis for local clusters
-                lisa_poverty = esda.Moran_Local(gdf_spatial['percent_poverty'], w)
-                
-                # Count significant LISA clusters
-                significant_clusters = np.sum(lisa_poverty.p_sim < 0.05)
-                print(f"  Significant local poverty clusters: {significant_clusters}")
-                
-            except Exception as e:
-                print(f"  Spatial autocorrelation analysis failed: {e}")
-        else:
-            print(f"  Insufficient valid spatial data: {len(gdf_spatial)} points")
-    
-    # 3. Hotspot Analysis
-    # Create grid and count spills per cell
-    xmin, ymin, xmax, ymax = gdf_proj.total_bounds
-    
-    # Create 5km x 5km grid
-    grid_size = 5000  # 5km in meters
-    x_coords = np.arange(xmin, xmax + grid_size, grid_size)
-    y_coords = np.arange(ymin, ymax + grid_size, grid_size)
-    
-    spill_density = calculate_spill_density(gdf_proj, x_coords, y_coords, grid_size)
-    
-    print(f"\nHotspot Analysis:")
-    print(f"  Grid cells created: {len(spill_density)}")
-    if len(spill_density) > 0:
-        print(f"  Max spills per 5km cell: {spill_density['spill_count'].max()}")
-        print(f"  Mean spills per cell: {spill_density['spill_count'].mean():.2f}")
-    else:
-        print("  No grid cells with spills found")
-    
-    return gdf, spill_density, n_clusters
-
-def calculate_spill_density(gdf_proj, x_coords, y_coords, grid_size):
-    """Calculate spill density on a grid"""
-    
-    density_data = []
-    
-    for i, x in enumerate(x_coords[:-1]):
-        for j, y in enumerate(y_coords[:-1]):
-            # Define grid cell bounds
-            cell_bounds = (x, y, x + grid_size, y + grid_size)
-            
-            # Count spills in this cell
-            mask = (
-                (gdf_proj.geometry.x >= cell_bounds[0]) &
-                (gdf_proj.geometry.x < cell_bounds[2]) &
-                (gdf_proj.geometry.y >= cell_bounds[1]) &
-                (gdf_proj.geometry.y < cell_bounds[3])
-            )
-            
-            spills_in_cell = gdf_proj[mask]
-            
-            if len(spills_in_cell) > 0:
-                density_data.append({
-                    'grid_x': x + grid_size/2,
-                    'grid_y': y + grid_size/2,
-                    'spill_count': len(spills_in_cell),
-                    'avg_poverty': spills_in_cell['percent_poverty'].mean(),
-                    'avg_income': spills_in_cell['median_household_income'].mean(),
-                    'major_spills': (spills_in_cell['More than five barrels spilled'].astype(str) == 'Y').sum()
-                })
-    
-    return pd.DataFrame(density_data)
-
-def spatial_regression_analysis(gdf):
-    """Perform spatial regression to control for location effects"""
-    
-    print("\nSPATIAL REGRESSION ANALYSIS")
-    print("="*50)
-    
-    # Create variables for regression
-    gdf_reg = gdf.copy()
-    gdf_reg['major_spill'] = (gdf_reg['More than five barrels spilled'].astype(str) == 'Y').astype(int)
-    gdf_reg['high_poverty'] = (gdf_reg['percent_poverty'] > 15).astype(int)
-    gdf_reg['minority_community'] = (gdf_reg['percent_white'] < 70).astype(int)
-    
-    # Add spatial controls (distance to urban centers, etc.)
-    # For now, use lat/lon as proxies for spatial effects
-    gdf_reg['lat_norm'] = (gdf_reg['Latitude'] - gdf_reg['Latitude'].mean()) / gdf_reg['Latitude'].std()
-    gdf_reg['lon_norm'] = (gdf_reg['Longitude'] - gdf_reg['Longitude'].mean()) / gdf_reg['Longitude'].std()
-    
-    # OLS regression: Major spill probability ~ demographics + spatial controls
-    model_formula = 'major_spill ~ percent_poverty + percent_white + median_household_income + lat_norm + lon_norm'
-    
-    try:
-        model = ols(model_formula, data=gdf_reg).fit()
-        
-        print("Regression Results (Major Spill Probability):")
-        print(f"  R-squared: {model.rsquared:.4f}")
-        print(f"  F-statistic p-value: {model.f_pvalue:.6f}")
-        
-        # Key coefficients
-        coef_poverty = model.params.get('percent_poverty', 0)
-        pval_poverty = model.pvalues.get('percent_poverty', 1)
-        
-        coef_white = model.params.get('percent_white', 0) 
-        pval_white = model.pvalues.get('percent_white', 1)
-        
-        coef_income = model.params.get('median_household_income', 0)
-        pval_income = model.pvalues.get('median_household_income', 1)
-        
-        print(f"\nKey Findings:")
-        print(f"  Poverty rate coefficient: {coef_poverty:.6f} (p={pval_poverty:.4f})")
-        print(f"  White percentage coefficient: {coef_white:.6f} (p={pval_white:.4f})")
-        print(f"  Income coefficient: {coef_income:.8f} (p={pval_income:.4f})")
-        
-        return model
-        
-    except Exception as e:
-        print(f"Regression analysis failed: {e}")
-        return None
-
-def generate_spatial_statistical_report(stats_results, spatial_results, model_results):
-    """Generate comprehensive report using LLM"""
-    
-    summary_text = f"""
-    STATISTICAL AND SPATIAL ANALYSIS SUMMARY:
-    
-    STATISTICAL SIGNIFICANCE TESTS:
-    - Income distribution chi-square p-value: {stats_results['income_chi2']['p_value']:.6f}
-    - Poverty over-representation ratio: {stats_results['poverty_binomial']['observed_ratio']:.2f}x
-    - Poverty binomial test p-value: {stats_results['poverty_binomial']['p_value']:.6f}
-    - Major spills z-test p-value: {stats_results['major_spills_ztest']['p_value']:.6f}
-    - Minority community ratio: {stats_results['minority_binomial']['observed_ratio']:.2f}x
-    
-    SPATIAL ANALYSIS:
-    - Number of spatial clusters identified: {spatial_results['n_clusters']}
-    - Spatial autocorrelation detected in poverty patterns
-    - Hotspots identified with up to {spatial_results.get('max_density', 'N/A')} spills per 5km grid
-    
-    REGRESSION FINDINGS:
-    - Spatial controls included to account for facility locations
-    - Multiple demographic variables tested simultaneously
-    - Results control for geographic clustering effects
-    """
-    
-    prompt = f"""
-    Based on this comprehensive statistical and spatial analysis of oil and gas spills, provide an academic-level interpretation of the environmental justice implications.
-    
-    Analysis Results:
-    {summary_text}
-    
-    Focus on:
-    1. Statistical significance of demographic disparities
-    2. Spatial clustering patterns and their implications
-    3. Whether disparities persist after controlling for spatial effects
-    4. Methodological strengths and limitations
-    5. Policy implications for environmental justice
-    6. Recommendations for further research
-    
-    Format as a rigorous academic discussion suitable for a public policy journal, emphasizing both statistical rigor and practical policy relevance.
-    """
-    
-    return query_ollama(prompt)
-
-def create_visualizations(gdf, spill_density):
-    """Create key visualizations"""
-    
-    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
-    
-    # 1. Spill locations by poverty rate
-    ax1 = axes[0, 0]
-    scatter = ax1.scatter(gdf['Longitude'], gdf['Latitude'], 
-                         c=gdf['percent_poverty'], cmap='Reds', 
-                         alpha=0.6, s=10)
-    ax1.set_title('Spill Locations by Poverty Rate')
-    ax1.set_xlabel('Longitude')
-    ax1.set_ylabel('Latitude')
-    plt.colorbar(scatter, ax=ax1, label='Poverty Rate (%)')
-    
-    # 2. Income distribution
-    ax2 = axes[0, 1]
-    income_quartiles = pd.qcut(gdf['median_household_income'], 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
-    income_counts = gdf.groupby(income_quartiles).size()
-    ax2.bar(income_counts.index, income_counts.values)
-    ax2.set_title('Spills by Income Quartile')
-    ax2.set_xlabel('Income Quartile')
-    ax2.set_ylabel('Number of Spills')
-    
-    # 3. Major spills by demographics
-    ax3 = axes[1, 0]
-    demo_data = pd.DataFrame({
-        'High Poverty': [
-            len(gdf[(gdf['percent_poverty'] > 15) & (gdf['More than five barrels spilled'].astype(str) == 'Y')]),
-            len(gdf[(gdf['percent_poverty'] > 15) & (gdf['More than five barrels spilled'].astype(str) != 'Y')])
-        ],
-        'Low Poverty': [
-            len(gdf[(gdf['percent_poverty'] <= 15) & (gdf['More than five barrels spilled'].astype(str) == 'Y')]),
-            len(gdf[(gdf['percent_poverty'] <= 15) & (gdf['More than five barrels spilled'].astype(str) != 'Y')])
-        ]
-    }, index=['Major Spills', 'Minor Spills'])
-    
-    demo_data.plot(kind='bar', ax=ax3, stacked=True)
-    ax3.set_title('Spill Severity by Poverty Level')
-    ax3.set_xlabel('Spill Type')
-    ax3.set_ylabel('Count')
-    ax3.legend(title='Community Type')
-    
-    # 4. Spatial density
-    ax4 = axes[1, 1]
-    if len(spill_density) > 0:
-        scatter2 = ax4.scatter(spill_density['grid_x'], spill_density['grid_y'],
-                              c=spill_density['spill_count'], cmap='YlOrRd',
-                              s=spill_density['spill_count']*10, alpha=0.7)
-        ax4.set_title('Spill Density Hotspots (5km Grid)')
-        ax4.set_xlabel('X Coordinate (Projected)')
-        ax4.set_ylabel('Y Coordinate (Projected)')
-        plt.colorbar(scatter2, ax=ax4, label='Spills per Cell')
-    
-    plt.tight_layout()
-    plt.savefig('environmental_justice_analysis.png', dpi=300, bbox_inches='tight')
-    plt.show()
-
-# Main execution
-def run_comprehensive_analysis(csv_file):
-    """Run complete statistical and spatial analysis"""
-    
-    print("COMPREHENSIVE STATISTICAL & SPATIAL ENVIRONMENTAL JUSTICE ANALYSIS")
-    print("="*80)
-    
-    # Load data
-    df = pd.read_csv(csv_file)
-    print(f"Loaded {len(df)} spill incidents")
-    
-    # Statistical analysis
-    stats_results = statistical_disparity_tests(df)
-    
-    # Spatial analysis
-    gdf, spill_density, n_clusters = spatial_analysis(df)
-    
-    # Spatial regression
-    model = spatial_regression_analysis(gdf)
-    
-    # Create visualizations
-    create_visualizations(gdf, spill_density)
-    
-    # Generate comprehensive report
-    spatial_results = {'n_clusters': n_clusters}
-    if len(spill_density) > 0:
-        spatial_results['max_density'] = spill_density['spill_count'].max()
-    
-    model_summary = str(model.summary()) if model else "Regression analysis not available"
-    
-    report = generate_spatial_statistical_report(stats_results, spatial_results, model_summary)
-    
-    # Save results
-    results = {
-        'statistical_tests': stats_results,
-        'spatial_analysis': spatial_results,
-        'regression_summary': model_summary,
-        'academic_interpretation': report
-    }
-    
-    with open('statistical_spatial_analysis.json', 'w') as f:
-        json.dump(results, f, indent=2, default=str)
-    
-    with open('academic_report.txt', 'w') as f:
-        f.write(report)
-    
-    print(f"\nAnalysis complete. Results saved to:")
-    print(f"  - statistical_spatial_analysis.json")
-    print(f"  - academic_report.txt")
-    print(f"  - environmental_justice_analysis.png")
-    
-    return results
-
-if __name__ == "__main__":
-    results = run_comprehensive_analysis('spills_with_demographics.csv')
--- a/data/spatial_statistical_analysis_output.txt
+++ b/data/spatial_statistical_analysis_output.txt
@@ -1,66 +0,0 @@
-COMPREHENSIVE STATISTICAL & SPATIAL ENVIRONMENTAL JUSTICE ANALYSIS
-================================================================================
-Loaded 16890 spill incidents
-STATISTICAL SIGNIFICANCE TESTS
-==================================================
-Income Distribution Test:
-  Chi-square statistic: 361.694
-  p-value: 0.000000
-  Significant disparity: YES
-
-Poverty Analysis:
-  High-poverty spills: 3497
-  Expected (if random): 3378
-  Ratio: 1.04x
-  Binomial test p-value: 0.011556
-  Significant over-representation: NO
-
-Major Spills in High-Poverty Areas:
-  High poverty major spill rate: 0.369
-  Low poverty major spill rate: 0.269
-  Z-statistic: 11.598
-  p-value: 0.000000
-  Significant difference: YES
-
-Racial Demographics Analysis:
-  Minority community spills: 1047
-  Expected (if random): 5067
-  Ratio: 0.21x
-  Binomial test p-value: 1.000000
-  Significant over-representation: NO
-
-SPATIAL ANALYSIS
-==================================================
-Spatial Clustering Results:
-  Number of clusters: 259
-  Number of noise points: 4749
-  Clustered points: 12141
-
-Spatial Autocorrelation (Moran's I):
-  Poverty rate Moran's I: 0.9714
-  p-value: 0.0010
-  Significant clustering: YES
-  Income Moran's I: 0.9585
-  p-value: 0.0010
-  Significant local poverty clusters: 9209
-
-Hotspot Analysis:
-  Grid cells created: 1189
-  Max spills per 5km cell: 119
-  Mean spills per cell: 14.21
-
-SPATIAL REGRESSION ANALYSIS
-==================================================
-Regression Results (Major Spill Probability):
-  R-squared: 0.0547
-  F-statistic p-value: 0.000000
-
-Key Findings:
-  Poverty rate coefficient: 0.009572 (p=0.0000)
-  White percentage coefficient: 0.004621 (p=0.0000)
-  Income coefficient: -0.00000098 (p=0.0000)
-
-Analysis complete. Results saved to:
-  - statistical_spatial_analysis.json
-  - academic_report.txt
-  - environmental_justice_analysis.png
--- a/data/spill_analysis.py
+++ b/data/spill_analysis.py
@@ -1,307 +0,0 @@
-import pandas as pd
-import requests
-import json
-from collections import Counter, defaultdict
-import numpy as np
-
-def query_ollama(prompt, model="mistral"):
-    """Send query to local Ollama instance"""
-    try:
-        response = requests.post('http://localhost:11434/api/generate',
-            json={
-                'model': model,
-                'prompt': prompt,
-                'stream': False
-            })
-        return response.json()['response']
-    except Exception as e:
-        print(f"Error querying Ollama: {e}")
-        return None
-
-def analyze_spill_demographics(df):
-    """Analyze demographic patterns in spill data"""
-    
-    # Basic demographic statistics
-    demo_stats = {
-        'total_spills': len(df),
-        'avg_median_income': df['median_household_income'].mean(),
-        'avg_poverty_rate': df['percent_poverty'].mean(),
-        'avg_white_percentage': df['percent_white'].mean(),
-        'avg_hispanic_percentage': df['percent_hispanic'].mean(),
-        'avg_unemployment': df['unemployment_rate'].mean()
-    }
-    
-    # Environmental justice analysis
-    # Define high-poverty communities (>15% poverty rate)
-    high_poverty = df[df['percent_poverty'] > 15]
-    low_poverty = df[df['percent_poverty'] <= 15]
-    
-    # Define minority communities (>30% non-white)
-    minority_communities = df[df['percent_white'] < 70]
-    white_communities = df[df['percent_white'] >= 70]
-    
-    # Convert spill volumes to numeric, handling 'Unknown' values
-    produced_water_numeric = pd.to_numeric(df['Produced Water Spill Volume'], errors='coerce')
-    high_poverty_volumes = pd.to_numeric(high_poverty['Produced Water Spill Volume'], errors='coerce')
-    
-    ej_analysis = {
-        'high_poverty_spills': len(high_poverty),
-        'high_poverty_avg_volume': high_poverty_volumes.sum(),
-        'minority_community_spills': len(minority_communities),
-        'spills_by_income_quartile': df.groupby(pd.qcut(df['median_household_income'], 4, labels=['Q1(Lowest)', 'Q2', 'Q3', 'Q4(Highest)'])).size().to_dict(),
-        'major_spills_by_poverty': {
-            'high_poverty_major': len(high_poverty[high_poverty['More than five barrels spilled'] == 'Y']),
-            'low_poverty_major': len(low_poverty[low_poverty['More than five barrels spilled'] == 'Y'])
-        }
-    }
-    
-    return demo_stats, ej_analysis
-
-def analyze_root_causes(df):
-    """Analyze already-categorized root causes"""
-    
-    # Count existing cause categories, handling NaN values
-    cause_counts = {
-        'human_error': df['Human Error'].fillna(0).sum(),
-        'equipment_failure': df['Equipment Failure'].fillna(0).sum(), 
-        'historical_unknown': df['Historical Unkown'].fillna(0).sum(),  # Note: typo in original data
-        'other': df['Other'].fillna(0).sum()
-    }
-    
-    # Get specific root cause descriptions
-    root_causes = df['Root Cause'].dropna().value_counts().head(10)
-    
-    return cause_counts, root_causes
-
-def analyze_spill_themes_llm(df, sample_size=50):
-    """Use LLM to analyze themes in spill descriptions"""
-    
-    # Sample descriptions for LLM analysis (to avoid overwhelming it)
-    descriptions_series = df['Spill Description'].dropna()
-    if len(descriptions_series) == 0:
-        return "No spill descriptions available for analysis."
-    
-    sample_descriptions = descriptions_series.sample(min(sample_size, len(descriptions_series))).tolist()
-    
-    # Combine descriptions for batch analysis
-    combined_text = "\n---\n".join(sample_descriptions)
-    
-    prompt = f"""
-    Analyze these oil and gas spill incident descriptions to identify themes and patterns.
-    Focus on:
-    1. Common equipment failures (tanks, valves, pipelines, etc.)
-    2. Operational issues (overflow, leaks, maintenance problems)
-    3. Environmental factors (weather, terrain, wildlife)
-    4. Human factors (operator error, maintenance issues)
-    5. Discovery methods (routine inspection, alarms, third-party reports)
-    6. Spill severity indicators
-    
-    Incident descriptions:
-    {combined_text}
-    
-    Provide a structured analysis with:
-    - Top 5 equipment failure patterns
-    - Most common operational issues  
-    - Environmental risk factors
-    - Human factor patterns
-    - Recommendations for prevention based on these patterns
-    
-    Format as a concise regulatory summary suitable for policy recommendations.
-    """
-    
-    return query_ollama(prompt)
-
-def demographic_spill_analysis(df):
-    """Analyze spill patterns by demographic characteristics"""
-    
-    # Create demographic categories
-    df_analysis = df.copy()
-    df_analysis['income_category'] = pd.cut(df_analysis['median_household_income'], 
-                                          bins=3, labels=['Low Income', 'Middle Income', 'High Income'])
-    df_analysis['poverty_category'] = pd.cut(df_analysis['percent_poverty'], 
-                                           bins=[0, 10, 20, 100], labels=['Low Poverty', 'Moderate Poverty', 'High Poverty'])
-    df_analysis['race_category'] = df_analysis['percent_white'].apply(
-        lambda x: 'Majority White' if x >= 70 else 'Minority Community'
-    )
-    
-    # Analyze spill patterns by demographics
-    demo_patterns = {
-        'spills_by_income': df_analysis.groupby('income_category').size().to_dict(),
-        'spills_by_poverty': df_analysis.groupby('poverty_category').size().to_dict(),
-        'spills_by_race': df_analysis.groupby('race_category').size().to_dict(),
-        'volume_by_demographics': {
-            'high_poverty_major_spills': len(df_analysis[(df_analysis['percent_poverty'] > 15) & 
-                                                       (df_analysis['More than five barrels spilled'].astype(str) == 'Y')]),
-            'minority_major_spills': len(df_analysis[(df_analysis['percent_white'] < 70) & 
-                                                   (df_analysis['More than five barrels spilled'].astype(str) == 'Y')])
-        }
-    }
-    
-    return demo_patterns
-
-def analyze_environmental_justice(df, sample_descriptions=20):
-    """Use LLM to analyze environmental justice implications"""
-    
-    # Get descriptions from high-poverty and minority communities
-    high_poverty_desc = df[df['percent_poverty'] > 15]['Spill Description'].dropna()
-    minority_desc = df[df['percent_white'] < 70]['Spill Description'].dropna()
-    
-    if len(high_poverty_desc) == 0 or len(minority_desc) == 0:
-        return "Insufficient data for environmental justice analysis."
-    
-    high_poverty_spills = high_poverty_desc.sample(min(sample_descriptions//2, len(high_poverty_desc))).tolist()
-    minority_spills = minority_desc.sample(min(sample_descriptions//2, len(minority_desc))).tolist()
-    
-    combined_ej_text = "\n---HIGH POVERTY AREA---\n".join(high_poverty_spills) + "\n---MINORITY COMMUNITY---\n".join(minority_spills)
-    
-    prompt = f"""
-    Analyze these spill incidents from high-poverty and minority communities for environmental justice concerns.
-    
-    Consider:
-    1. Severity of incidents in vulnerable communities
-    2. Response effectiveness and cleanup completion
-    3. Long-term environmental impacts
-    4. Patterns that might indicate disproportionate impacts
-    5. Regulatory compliance and enforcement patterns
-    
-    Spill descriptions:
-    {combined_ej_text}
-    
-    Provide an environmental justice assessment focusing on:
-    - Whether vulnerable communities face more severe incidents
-    - Quality of response and remediation
-    - Policy recommendations for equitable environmental protection
-    """
-    
-    return query_ollama(prompt)
-
-def comprehensive_spill_analysis(csv_file):
-    """Run complete analysis of spill data"""
-    
-    print("Loading spill data...")
-    df = pd.read_csv(csv_file)
-    
-    print(f"Analyzing {len(df)} spill incidents...")
-    
-    # Basic demographic analysis
-    demo_stats, ej_analysis = analyze_spill_demographics(df)
-    
-    # Root cause analysis (using existing categorizations)
-    cause_counts, root_causes = analyze_root_causes(df)
-    
-    # Demographic patterns
-    demo_patterns = demographic_spill_analysis(df)
-    
-    # LLM-based theme analysis
-    print("Running LLM analysis on spill descriptions...")
-    theme_analysis = analyze_spill_themes_llm(df, sample_size=100)
-    
-    # Environmental justice analysis
-    print("Analyzing environmental justice implications...")
-    ej_llm_analysis = analyze_environmental_justice(df, sample_descriptions=30)
-    
-    # Compile comprehensive results
-    results = {
-        'summary_statistics': {
-            'total_incidents': len(df),
-            'date_range': f"{df['Date of Discovery'].min()} to {df['Date of Discovery'].max()}",
-            'counties_affected': df['county'].nunique(),
-            'operators_involved': df['Operator'].nunique()
-        },
-        'demographic_statistics': demo_stats,
-        'environmental_justice_analysis': ej_analysis,
-        'root_cause_analysis': {
-            'cause_counts': cause_counts,
-            'top_root_causes': root_causes.to_dict()
-        },
-        'demographic_patterns': demo_patterns,
-        'llm_theme_analysis': theme_analysis,
-        'llm_environmental_justice': ej_llm_analysis
-    }
-    
-    return results
-
-def generate_policy_report(results):
-    """Generate policy-focused summary using LLM"""
-    
-    # Create summary for LLM to process
-    summary_text = f"""
-    SPILL DATA ANALYSIS SUMMARY:
-    
-    Total Incidents: {results['summary_statistics']['total_incidents']}
-    Date Range: {results['summary_statistics']['date_range']}
-    
-    DEMOGRAPHIC PATTERNS:
-    - Average poverty rate in affected areas: {results['demographic_statistics']['avg_poverty_rate']:.1f}%
-    - Average income: ${results['demographic_statistics']['avg_median_income']:,.0f}
-    - Spills in high-poverty areas: {results['environmental_justice_analysis']['high_poverty_spills']}
-    - Spills in minority communities: {results['environmental_justice_analysis']['minority_community_spills']}
-    
-    ROOT CAUSES:
-    - Equipment failures: {results['root_cause_analysis']['cause_counts']['equipment_failure']}
-    - Human error: {results['root_cause_analysis']['cause_counts']['human_error']}
-    - Historical/unknown: {results['root_cause_analysis']['cause_counts']['historical_unknown']}
-    
-    THEME ANALYSIS:
-    {results['llm_theme_analysis']}
-    
-    ENVIRONMENTAL JUSTICE ANALYSIS:
-    {results['llm_environmental_justice']}
-    """
-    
-    policy_prompt = f"""
-    Based on this comprehensive spill data analysis, create a policy-focused executive summary.
-    
-    Data Summary:
-    {summary_text}
-    
-    Provide:
-    1. Key findings on environmental justice impacts
-    2. Priority areas for regulatory attention
-    3. Specific policy recommendations for prevention
-    4. Recommendations for equitable enforcement
-    5. Suggested regulatory changes based on patterns identified
-    
-    Format as an executive summary suitable for regulatory decision-makers and policy researchers.
-    """
-    
-    return query_ollama(policy_prompt)
-
-# Execute comprehensive analysis
-if __name__ == "__main__":
-    # Run the analysis
-    results = comprehensive_spill_analysis('spills_with_demographics.csv')
-    
-    # Generate policy report
-    print("\nGenerating policy-focused summary...")
-    policy_report = generate_policy_report(results)
-    
-    # Save all results
-    with open('comprehensive_spill_analysis.json', 'w') as f:
-        json.dump(results, f, indent=2, default=str)
-    
-    with open('policy_executive_summary.txt', 'w') as f:
-        f.write(policy_report)
-    
-    # Print key findings
-    print("\n" + "="*60)
-    print("COMPREHENSIVE SPILL ANALYSIS COMPLETE")
-    print("="*60)
-    
-    print(f"\nTotal incidents analyzed: {results['summary_statistics']['total_incidents']:,}")
-    print(f"Counties affected: {results['summary_statistics']['counties_affected']}")
-    print(f"Average poverty rate in affected areas: {results['demographic_statistics']['avg_poverty_rate']:.1f}%")
-    print(f"Spills in high-poverty communities: {results['environmental_justice_analysis']['high_poverty_spills']:,}")
-    print(f"Spills in minority communities: {results['environmental_justice_analysis']['minority_community_spills']:,}")
-    
-    print(f"\nRoot cause breakdown:")
-    for cause, count in results['root_cause_analysis']['cause_counts'].items():
-        print(f"  {cause.replace('_', ' ').title()}: {count:,}")
-    
-    print(f"\nResults saved to:")
-    print(f"  - comprehensive_spill_analysis.json (detailed data)")
-    print(f"  - policy_executive_summary.txt (executive summary)")
-    
-    print(f"\nPolicy Summary Preview:")
-    print("="*40)
-    print(policy_report[:500] + "...")
				`@@ -1 +0,0 @@`
				`David Adams,dadams,thinkingdead,04.07.2025 23:33,file:///home/dadams/.config/libreoffice/4;`