- 1Virginia Tech, Civil Engineering, United States of America (elidcook51@gmail.com)
- 2Virginia Tech, Department of Civil and Environmental Engineering
- 3Virginia Tech, Department of Population Health Sciences
Boil water advisories (BWAs) are essential public health alerts issued when drinking water safety is compromised, yet the United States lacks a centralized database to track these events. Such a dataset would enable epidemiological studies, infrastructure resilience assessments, and policy analysis to better understand advisory causes, impacts, and regional disparities. This research introduces a scalable framework for building this database and a generalizable methodology for converting unstructured online information into machine-readable datasets. Our approach integrates automated web scraping with large language models (LLMs) to extract and standardize advisory attributes such as location, duration, and cause. Preliminary validation compares U.S. data against ground-truth datasets from Canada and Kentucky to assess coverage and accuracy, with early findings indicating substantial capture of advisories despite variability in reporting formats. Future work will refine search strategies to improve precision and extend this methodology to other domains lacking centralized data, such as water quality violations and emergency notifications. This study demonstrates the potential of combining web scraping and LLM-based text processing to address critical data gaps in environmental and public health monitoring.
How to cite: Cook, E., Marston, L., and Cohen, A.: Toward a National Database of Boil Water Advisories in the United States Using Web Scraping and Large Language Models, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7185, https://doi.org/10.5194/egusphere-egu26-7185, 2026.