Hidden Stories in Hydrologic Literature: An Interactive Topic-Based Ontology

Mashrekur Rahman; Grey Nearing; Jonathan Frame

doi:https://doi.org/10.5194/egusphere-egu2020-882

[Back] [Session HS1.2.3]

EGU2020-882, updated on 05 Aug 2021

https://doi.org/10.5194/egusphere-egu2020-882

EGU General Assembly 2020

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Hidden Stories in Hydrologic Literature: An Interactive Topic-Based Ontology

Mashrekur Rahman, Grey Nearing, and Jonathan Frame

Mashrekur Rahman et al.

Department of Geological Sciences, University of Alabama, Tuscaloosa, AL, USA (geology@geo.ua.edu)

Hydrologic research generates massive volumes of peer-reviewed literature across a plethora of evolving topics and sub-topics. It’s becoming increasingly difficult for scientists and practitioners to synthesize and leverage the full body of scientific literature. Recent advancement of computational linguistics, machine learning, including a variety of toolboxes for Natural Language Processing (NLP), help facilitate analysis of vast electronic corpuses for a multitude of objectives. Research papers published as electronic text files in different journals offer windows into trending topics and developments, and NLP allows us to extract information and insight about these trends.

This project applies Latent Dirichlet Allocation (LDA) Topic Modeling for bibliometric analyses of all peer-reviewed articles in selected high-impact (Impact Factor > 0.9) journals in hydrology (Water Resources Research, Hydrology and Earth System Sciences, Journal of Hydrology, Hydrological Processes, Advances in Water Resources, Hydrological Sciences Journal, Journal of Hydrometeorology). Topic modeling uses statistical algorithms to extract semantic information from a collection of texts and has become an emerging quantitative method to assess substantial textual data. After acquiring all the papers published in the aforementioned journals and applying multiple pre-processing routines including removing punctuations, nonsensical texts, stopwords, and tokenizing, stemming, lemmatization etc., the resultant corpus was fed to the LDA model for ‘learning’ latent intellectual topics. We achieved this using Gensim, an open-source Python library widely used for unsupervised semantic modeling with LDA. The optimal number of topics (k) and model hyperparameters were decided using coherence and perplexity values for multiple LDA models with varying k. The resulting generated topics are interpretable based on our prior knowledge of hydrology and related sub-disciplines. Comparative topic trend, term, and document level cluster analyses based on different time periods, journals and authors were performed. These analyses revealed topics such as climate change research gaining popularity in Hydrology over the last decade.

We aim to use these results combined with probability distribution between topics, journals and authors to create an interactive ontology map that is useful for research scientists and environmental consultants for exploring relevant literature based on topics and topic relationships. The primary objective of this work is to allow science practitioners to explore new branches and connections in the Hydrology literature, and to facilitate comprehensive and inclusive literature reviews. Second-order beneficiaries are decision and policy makers: the proposed project will provide insights into current research trends and help identify transitions and argumentative viewpoints in hydrologic research. The outcomes of this project will also serve as tools to facilitate effective science communication and aid in bridging gaps between scientists and stakeholders of their research.

How to cite: Rahman, M., Nearing, G., and Frame, J.: Hidden Stories in Hydrologic Literature: An Interactive Topic-Based Ontology, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-882, https://doi.org/10.5194/egusphere-egu2020-882, 2019.

Displays

Display file