- 1Computational Humanities, Leipzig University, Leipzig, Germany
- 2Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Leipzig University, Leipzig, Germany
- 3Department of Urban and Environmental Sociology, Helmholtz Centre for Environmental Research, Leipzig, Germany
The increasing use of text as data in environmental research offers valuable opportunities, but the inherent biases within textual sources like news, social media, or disaster reports necessitate moving beyond purely descriptive analyses. While NLP techniques like topic modeling and categorical annotations can identify emergent patterns, they often fail to elucidate the underlying causal mechanisms driving observed phenomena, especially within the complex interplay of anthropogenic activities, societal structures, and environmental outcomes. The reductionist tendencies of NLP, especially when dealing with complex social phenomena, often neglect the nuances of language and context, leading to potentially trivial or superficial findings when results are merely validated post-hoc against existing literature. This highlights the missed chance to leverage the extensive existing literature on climate research, for instance, to inform the a priori development of theoretical frameworks that could guide the research process. This not only validates the variable constructs but also prevents the validation and discovery of findings solely based on detected patterns. Instead, it explicitly searches for patterns that are relevant and address the research question. In a way, it tests what is expected or unexpected, minimizing blind spots and positivist statements. This approach doesn't hinder exploratory approaches that yield new hypotheses; rather, it meaningfully combines them with the actual research question.
To address this, a theoretically grounded approach is crucial, moving from describing "what" to explaining "why." This entails embedding the research question within a robust theoretical framework, operationalizing key concepts into measurable variables, and developing a coding scheme that links these variables to their manifestations in the text. This coding scheme is not just an arbitrary set of labels, but a theoretically grounded codebook that ensures the validity of subsequent analyses. NLP then serves as an annotation tool, generating data that reflects these operationalized variables, with rigorous validation ensuring the annotations' accuracy. Instead of simply describing the distributional properties of these annotations, statistical modeling techniques can be used to test a priori hypotheses derived from the theoretical framework. By comparing models based on both statistical fit and theoretical plausibility, researchers can identify the most probable explanation for the observed relationships, thereby uncovering the causal mechanisms at play.
In this contribution, we exemplify this approach by utilizing LLM-based information extraction to annotate disaster impacts from scientific papers based on predefined and well understood classes and their textual representation. We employ Structural Equation Models, Exploratory Factor Analysis, and regression to test models derived by literature and compare their probability given the data, demonstrating how this method produces robust, explainable results that go beyond the surface-level findings of exploratory approaches and move towards a deeper understanding of complex environmental phenomena. This integrated approach allows researchers to not just identify patterns in large textual datasets, but to understand the reasons behind them and generate valid and reliable insights in the field of environmental research.
How to cite: Niekler, A., Carvalho, T. M. N., and de Brito, M. M.: Theoretical-Deductive Content Analysis of Text as Data in Environmental Research, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-4468, https://doi.org/10.5194/egusphere-egu25-4468, 2025.