updated notes

This commit is contained in:
2025-08-15 06:59:23 -07:00
parent 49823e19ec
commit 388f0ae1c2

View File

@@ -14116,133 +14116,6 @@
"print('Wrote', out_col)"
]
},
{
"cell_type": "markdown",
"id": "1ac9b815",
"metadata": {},
"source": [
"Short contract (what to state exactly in Methods)\n",
"Inputs: event-level rows with report_delay (days), spill_type, Period (Before 2020 vs 2020 and After), rurality (RUCA-derived), trimmed per IQR (with winsorize sensitivity).\n",
"Primary model: Poisson GLM, formula: report_delay ~ C(spill_type) * C(Period) * C(rurality).\n",
"Inference: analytic HC3 SEs where available; otherwise, parametric bootstrap (Poisson) and nonparametric case bootstrap as robustness; report bootstrap medians and 95% CIs.\n",
"Sensitivity: Negative-Binomial (empirical alpha MOM, GLM-NB; discrete NB MLE when stable) with parametric NB bootstrap.\n",
"ITS: monthly aggregated counts, OLS-level ITS with NeweyWest HAC (lag=3) and case bootstrap for CIs; report level and trend changes.\n",
"Outputs: CSVs and PNGs in analysis/new analysis Aug 2025 (list below).\n",
"Edge cases to mention in methods\n",
"\n",
"Missing/zero delays, negative derived delays (if any) — how handled.\n",
"Small groups (few events by spill_type × rurality) — warn about wide bootstrap intervals.\n",
"Discrete NB MLE failures — describe fallback to GLM-NB and per-draw fallback in bootstrap.\n",
"Methods — detailed outline (what to write and roughly how)\n",
"Data and sample\n",
"\n",
"Source(s), time window, inclusion/exclusion, how report_delay is computed (dates used), unit (days), minimal cleaning steps.\n",
"RUCA → rurality categories and top-3 counties filter if used.\n",
"Outcome and grouping variables\n",
"\n",
"Define report_delay, spill_type, Period (how 2020 breakpoint defined), rurality.\n",
"Show counts by group (small table).\n",
"Preprocessing and outlier handling\n",
"\n",
"IQR-trimming rule, lower truncation to zero, winsorize(99%) as sensitivity.\n",
"Provide exact row counts removed/kept and refer to spills_trimmed.parquet or CSV exports.\n",
"Primary analytic model\n",
"\n",
"State model formula (as above).\n",
"Explain link function (log), interpretation of coefficients (on conditional mean), and how predicted group averages are calculated (delta method vs bootstrap).\n",
"Inference strategy\n",
"\n",
"HC3 analytic SEs: when and why used (cite HC3).\n",
"Parametric Poisson bootstrap: describe simulation-refit procedure, number of draws (B), how predicted medians and CI constructed.\n",
"Nonparametric (case) bootstrap: describe resampling rows, B draws.\n",
"Decision rule: when bootstrap used instead of analytic SE (e.g., HC3 not available or questionable diagnostics).\n",
"Negative-Binomial sensitivity\n",
"\n",
"How empirical alpha estimated (MOM), GLM-NB fit with alpha, discrete NB MLE attempt; parametric NB bootstrap via gamma→Poisson simulation; per-draw fallback to GLM if discrete MLE fails.\n",
"Mention nb_parametric_boot_predicted_means_appended.csv and nb_contrasts_*.csv.\n",
"ITS methods\n",
"\n",
"Aggregation to monthly counts, reindex to complete months.\n",
"OLS ITS formula (level + trend + change in level and slope at 2020), NeweyWest HAC (lag choice) and case bootstrap for CIs.\n",
"Refer to its_summary.csv and its_combined.png.\n",
"Software and reproducibility\n",
"\n",
"Python libs (pandas, statsmodels, scipy, matplotlib), notebook path analayis11_2020_nooutliers.ipynb.\n",
"Location of final CSVs/PNGs: analysis/new analysis Aug 2025/ (explicit list below).\n",
"State random-seed behavior for bootstrap and B used.\n",
"Results — structure & writing strategy\n",
"Write Results in this order (each with concise numerical statements and a short interpretation sentence):\n",
"\n",
"Sample description\n",
"\n",
"N events, median report_delay overall and by rurality/spill_type, noting trimmed/winsorized fraction.\n",
"Primary Poisson results\n",
"\n",
"Present a compact table: predicted median delays (days) and 95% CIs for each spill_type × Period × rurality. Use final_contrast_comparison_combined.csv as source.\n",
"Key text: highlight the most policy-relevant contrast (e.g., Urban, 2020 and After: median change, CI, p-value, translation to hours and percent). Provide numeric translation (days → hours) and percent-change where helpful.\n",
"NB sensitivity summary\n",
"\n",
"State whether NB median estimates materially change direction/magnitude compared to Poisson; emphasize if CIs overlap or differ.\n",
"If NB indicates larger negative effects but with much wider uncertainty, state that clearly and recommend conservatism in interpretation.\n",
"ITS findings\n",
"\n",
"Summarize level and slope changes around 2020 for total counts and by spill type × rurality as needed. Point to its_combined.png and its_summary.csv.\n",
"Robustness\n",
"\n",
"Report winsorize sensitivity and IQR trimming effects (if similar, state robust).\n",
"Note any groups with unstable estimates (small counts, wide CIs).\n",
"Short policy interpretation paragraph\n",
"\n",
"Translate numeric findings to practical terms: e.g., “Urban spills after 2020 show a median 2.9day decrease in reporting delay (CI ...), approximately 69 hours faster, which may affect exposure-response timing and public notice windows.”\n",
"Temper causal claims: describe change vs causal attribution.\n",
"Results — recommended tables & figures (and suggested captions)\n",
"Table 1: Sample characteristics (N, median delay, IQR) by Period × rurality.\n",
"Table 2 (main): Predicted median change in delay (days), 95% CI, p-value, hours and percent change — use final_contrast_comparison_combined.csv.\n",
"Caption: “Predicted changes in reporting delay (days) by period and rurality from Poisson and NB sensitivity models. Medians and 95% bootstrap CIs reported.”\n",
"Figure 1: final_contrast_comparison_multipanel.png (or symlog annotated): Poisson vs NB estimates by rurality and Period.\n",
"Caption: use the symlog caption provided earlier.\n",
"Figure 2: ITS figure: its_combined.png.\n",
"Caption: “Interrupted time series of monthly counts with level and slope estimates (NeweyWest HAC; bootstrap CIs).”\n",
"Appendix figures: diagnostic plots (dispersion test, Pearson χ²/df), histograms of report_delay, bootstrap distribution histograms.\n",
"Appendix — items and exact content to include\n",
"Full model outputs\n",
"\n",
"Full GLM Poisson coefficient tables with HC3 and sandwich results.\n",
"GLM-NB and discrete NB MLE outputs; note any draws where MLE failed.\n",
"Bootstrap implementation details\n",
"\n",
"Exact B used per run, seed, how many successful draws (report effective draw count).\n",
"Code snippet (or notebook cell number) implementing parametric NB bootstrap and fallback logic.\n",
"Supplementary tables (CSV links)\n",
"\n",
"nb_parametric_boot_predicted_means_appended.csv\n",
"nb_contrasts_period.csv, nb_contrasts_spilltype.csv\n",
"contrast_comparison_*_enhanced.csv (if present)\n",
"final_contrast_comparison_combined.csv (already present)\n",
"Diagnostics and sensitivity plots\n",
"\n",
"Dispersion tests, residual plots, histogram of report_delay, winsorize vs trimmed comparisons.\n",
"ITS technical appendix\n",
"\n",
"ITS model formula, NW lag choice justification, full coefficient tables, bootstrap draws CSV for ITS coefficients.\n",
"Reproducible instructions\n",
"\n",
"One-paragraph “How to reproduce” with path to analayis11_2020_nooutliers.ipynb and the exact cell numbers to run (or top-to-bottom run order). Provide exact filenames to expect in analysis/new analysis Aug 2025.\n",
"Code archive & data dictionary\n",
"\n",
"Minimal README in appendix listing saved files and what each contains so reviewers can inspect.\n",
"Writing order I recommend (practical)\n",
"Draft Methods first (complete and precise; reviewers read this closely).\n",
"Produce main Results bullets and Table 2 using final_contrast_comparison_combined.csv.\n",
"Add Figure 1 (symlog annotated) and ITS figure.\n",
"Write short Discussion/Interpretation paragraph.\n",
"Finish Appendix with full outputs and code pointers.\n",
"Short suggested phrasing snippets\n",
"Methods lead: “We modelled reporting delay (days) using Poisson generalized linear models with a log link and a full interaction between spill type, Period (Before 2020 vs 2020 and After), and rurality. Where analytic HC3 robust standard errors were unavailable or diagnostics indicated misspecification, we report parametric bootstrap confidence intervals; NegativeBinomial sensitivity analyses used an empirical alpha and parametric NB bootstrap.”\n",
"Results lead: “In the primary Poisson specification, median estimated changes in reporting delay were near zero across rurality groups; NB sensitivity estimates tended toward larger negative changes but with substantially wider uncertainty (see Figure X).”\n",
"Limitations line: “Inference is conditional on correct model specification and trimming choices; NB sensitivity and bootstrap intervals address but do not eliminate residual model uncertainty.”"
]
},
{
"cell_type": "markdown",
"id": "6c0d31df",
@@ -14432,129 +14305,7 @@
"\n",
"## 7. Code Archive & Data Dictionary\n",
"\n",
"- Minimal README listing saved files and contents for reviewer inspection.\n",
"\n",
"---\n",
"\n",
"# Writing Order Recommendation\n",
"\n",
"1. Draft Methods (complete and precise).\n",
"2. Main Results bullets and Table 2 (using `final_contrast_comparison_combined.csv`).\n",
"3. Add Figure 1 (symlog annotated) and ITS figure.\n",
"4. Write short Discussion/Interpretation paragraph.\n",
"5. Finish Appendix with full outputs and code pointers.\n",
"\n",
"---\n",
"\n",
"# Suggested Phrasing Snippets\n",
"\n",
"- **Methods lead:** \n",
" “We modelled reporting delay (days) using Poisson generalized linear models with a log link and a full interaction between spill type, Period (Before 2020 vs 2020 and After), and rurality. Where analytic HC3 robust standard errors were unavailable or diagnostics indicated misspecification, we report parametric bootstrap confidence intervals; NegativeBinomial sensitivity analyses used an empirical alpha and parametric NB bootstrap.”\n",
"\n",
"- **Results lead:** \n",
" “In the primary Poisson specification, median estimated changes in reporting delay were near zero across rurality groups; NB sensitivity estimates tended toward larger negative changes but with substantially wider uncertainty (see Figure X).”\n",
"\n",
"- **Limitations line:** \n",
" “Inference is conditional on correct model specification and trimming choices; NB sensitivity and bootstrap intervals address but do not eliminate residual model uncertainty.”"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "91541d1f",
"metadata": {
"vscode": {
"languageId": "latex"
}
},
"outputs": [],
"source": [
"% -----------------------\n",
"% Methods\n",
"% -----------------------\n",
"\\section{Methods}\n",
"We analyzed event-level reporting delay (days), computed as the difference between initial report date and date of discovery, using the notebook \\texttt{analayis11\\_2020\\_nooutliers.ipynb}. Observations with implausible delays were inspected and extreme values were handled with an IQR-based trimming rule (lower bound truncated to zero); a 99th-percentile winsorized sensitivity check was also performed. The primary model was a Poisson generalized linear model (log link) with a full interaction:\n",
"\\[\n",
"\\text{report\\_delay} \\sim \\mathrm{C}(\\textit{spill\\_type}) \\times \\mathrm{C}(\\textit{Period}) \\times \\mathrm{C}(\\textit{rurality}).\n",
"\\]\n",
"When available and appropriate we report HC3 robust standard errors; where HC3 was unavailable or model diagnostics suggested misspecification we report parametric bootstrap confidence intervals (simulation under the fitted model, refit, $B\\approx 2000$ successful draws) and a nonparametric (case) bootstrap as a robustness check. NegativeBinomial sensitivity analyses used an empirical NB2 dispersion ($\\alpha$) estimated by method-of-moments, GLMNB fits, and discrete NB MLE where numerically stable; parametric NB bootstrap draws were generated by the gamma→Poisson mixture and refit with a GLMNB fallback if discrete MLE failed. Interrupted time series (ITS) analyses of monthly counts used OLS with NeweyWest HAC standard errors (lag=3) and a case bootstrap for confidence intervals. All code and output CSV/PNG files referenced below are saved in \\texttt{analysis/new analysis Aug 2025/}.\n",
"\n",
"% -----------------------\n",
"% Results\n",
"% -----------------------\n",
"\\section{Results}\n",
"In the primary Poisson specification, estimated changes in reporting delay were generally near zero and precisely estimated. For example, for Urban incidents in the 2020 and After period the Poisson model gives a median estimated change of approximately 0.29 days (95\\% CI 0.24 to 0.34 days; $\\approx$ 7.0 hours); see Table~\\ref{tab:main-contrasts} and Figure~\\ref{fig:poisson-vs-nb}. NegativeBinomial sensitivity estimates, which accommodate extra-Poisson variance, produce a larger (in absolute value) median for the same contrast: median \\(-2.87\\) days (95\\% CI \\(-3.73\\) to \\(-2.01\\) days; \\(\\approx\\) \\(-68.9\\) hours). The divergence in magnitude reflects sensitivity to the variance function: NB allows greater dispersion and therefore often yields larger point estimates with wider uncertainty. We present the Poisson results as primary and report NB results as sensitivity in the Appendix; contrasts and full bootstrap summaries are provided in the repository outputs.\n",
"\n",
"% Table reference (suggested)\n",
"\\begin{table}[t]\n",
"\\centering\n",
"\\caption{Selected predicted changes in reporting delay (days). Medians and 95\\% bootstrap CIs are shown for Poisson (primary) and NegativeBinomial (sensitivity). Full table: \\texttt{final\\_contrast\\_comparison\\_combined.csv}.}\n",
"\\label{tab:main-contrasts}\n",
"\\begin{tabular}{llrrr}\n",
"\\toprule\n",
"Period & Rurality & Model & Median (days) & 95\\% CI \\\\\n",
"\\midrule\n",
"2020 and After & Urban & Poisson & 0.29 & (0.24, 0.34) \\\\\n",
"2020 and After & Urban & NegBin & -2.87 & (-3.73, -2.01) \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\end{table}\n",
"\n",
"% Figure reference (suggested)\n",
"\\begin{figure}[t]\n",
"\\centering\n",
"\\includegraphics[width=0.95\\textwidth]{analysis/new analysis Aug 2025/final_contrast_comparison_symlog_annotated.png}\n",
"\\caption{Poisson vs NegativeBinomial estimated change in reporting delay (days), by rurality and period. Points show median estimates and vertical lines show 95\\% CIs; blue circles = Poisson GLM, orange squares = NegativeBinomial sensitivity. The signed-log (symlog) scale displays magnitudes of negative delays while preserving near-zero Poisson estimates.}\n",
"\\label{fig:poisson-vs-nb}\n",
"\\end{figure}\n",
"\n",
"% -----------------------\n",
"% Appendix / Reproducibility\n",
"% -----------------------\n",
"\\appendix\n",
"\\section{Appendix: reproducibility and full outputs}\n",
"All code used to generate the analyses is in \\texttt{analayis11\\_2020\\_nooutliers.ipynb}. Key output files (in \\texttt{analysis/new analysis Aug 2025/}) include:\n",
"\\begin{itemize}\n",
" \\item \\texttt{final\\_contrast\\_comparison\\_combined.csv} : combined Poisson and NB contrast table used for main text Table~\\ref{tab:main-contrasts}.\n",
" \\item \\texttt{final\\_contrast\\_comparison\\_symlog\\_annotated.png} and \\texttt{final\\_contrast\\_comparison\\_symlog\\_column.png} : annotated multipanel figures (Poisson vs NB).\n",
" \\item \\texttt{nb\\_parametric\\_boot\\_predicted\\_means\\_appended.csv}, \\texttt{nb\\_contrasts\\_period.csv}, \\texttt{nb\\_contrasts\\_spilltype.csv} : raw NB bootstrap predicted means and contrasts.\n",
" \\item \\texttt{its\\_combined.png}, \\texttt{its\\_summary.csv} : ITS figures and coefficient summaries.\n",
" \\item diagnostic and preprocessing exports: \\texttt{spills\\_trimmed.parquet}, \\texttt{spills\\_trimmed\\_removed.parquet}, etc.\n",
"\\end{itemize}\n",
"\n",
"Reproducibility note: to reproduce main tables and figures, run the notebook from top to bottom in a Python environment with pandas, statsmodels, scipy, and matplotlib installed; the notebook cells produce the CSVs and PNGs listed above. In the Appendix include (i) full GLM coefficient tables (HC3 and bootstrap), (ii) NB MLE notes (report any per-draw failures and the GLM fallback), and (iii) bootstrap diagnostics (number of draws, number of successful draws). "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5333ff82",
"metadata": {
"vscode": {
"languageId": "latex"
}
},
"outputs": [],
"source": [
"@inproceedings{seabold2010statsmodels,\n",
" title = {Statsmodels: Econometric and Statistical Modeling with {P}ython},\n",
" author = {Seabold, Skipper and Perktold, Josef},\n",
" booktitle = {Proceedings of the 9th Python in Science Conference},\n",
" pages = {61--66},\n",
" year = {2010},\n",
" url = {https://www.statsmodels.org/}\n",
"}\n",
"\n",
"@article{virtanen2020scipy,\n",
" title = {{SciPy} 1.0: Fundamental Algorithms for Scientific Computing in Python},\n",
" author = {Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and {et al.}},\n",
" journal = {Nature Methods},\n",
" volume = {17},\n",
" number = {3},\n",
" pages = {261--272},\n",
" year = {2020},\n",
" doi = {10.1038/s41592-019-0686-2}\n",
"}"
"- Minimal README listing saved files and contents for reviewer inspection.\n"
]
}
],