Development and Application of a Model to Extract Disaster and Safety-related Information from News Big Data reported in the Media using Text Mining
- 1National Disaster Management Research Institute, Disaster investigation division, Ulsan, Korea, Republic of (ecofriend97@gmail.com)
- 2National Disaster Management Research Institute, Center for Disaster Risk Identification and Assessment, Ulsan, Korea, Republic of
In Korea, when a disaster occurs, numerous news related to the disaster are reported very quickly through various media. These news contain useful disaster-related information that disaster researchers need, such as the causes of disasters, problems in the process of disaster occurrence, and improvement measures suggested by related experts. However, finding articles containing the disaster-related information we need from the numerous news reports in the media is not easy and takes a long time. Accordingly, in this study, the R-Scanner model using text mining technology was developed to extract disaster and safety information desired by users from large-scale news data, which is 'unstructured big data'. Here, R stands for Risk. The developed model was constructed based on natural language processing systems for Korean and English and was developed to perform Sentence Segmentation, Tokenization, and Morphological Analysis using text as analysis data. In the Morphological Analysis process, the model was developed to perform Entity Recognition, Semantic Role Labeling, and Semantic Chunking. Additionally, the model was developed to extract articles containing the desired information from news big data reported through the media when the user inputs keywords related to the desired information, and the extracted articles can be downloaded in Excel format. To verify the performance of the developed model, we applied it to landslides that resulted in 14 deaths due to torrential rains in Korea in 2023. Problems and improvement measures in the landslide occurrence process were set with the desired information, and keywords were set to extract each information. About 200 keywords related to problems were set, such as 'procrastination', 'defenseless', 'ignored', 'sloppy', and 'careless', and about dozens of keywords such as ‘suggested’, ‘should be prepared’, and ‘necessary’ were set as keywords related to improvement measures. As a result of applying the model, a total of 364 articles related to problems and improvement measures were extracted from 30 media news 15,911,665 articles, and as a result of grouping the extracted problems and improvement measures into similar contents, 24 problems and 22 improvement measures were finally derived. As a result of the review of related experts on the problems and improvement measures derived, it was confirmed that the contents were quite meaningful. The problems and improvement measures derived in this way were used as basic data for the establishment of government measures to prevent landslides. In the future, the developed model is expected to be used not only to establish the government's countermeasures for disaster, but also to monitor real-time disaster and safety issues, and furthermore to detect disaster risks at an early stage.
How to cite: Choi, S., Kim, D. W., Shin, E. H., and Kim, Y. J.: Development and Application of a Model to Extract Disaster and Safety-related Information from News Big Data reported in the Media using Text Mining, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-6817, https://doi.org/10.5194/egusphere-egu24-6817, 2024.