EGU25-20123, updated on 15 Mar 2025
https://doi.org/10.5194/egusphere-egu25-20123
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
Automatic annotation following the I-ADOPT framework
José Manuel Gómez Pérez and Andrés García
José Manuel Gómez Pérez and Andrés García
  • expert.ai, Language Technology Research Laboratory, (jmgomez@expert.ai)

The fulfillment of the FAIR principles is a central requirement in modern research. Data findability and reusability are highly dependent on the quality and interoperability of their metadata. Among other attributes in earth and environmental sciences, FAIR metadata should ensure consistent and uniquely referenceable naming of geoscientific variables that support machine-interpretable semantic annotations. But in practice, most terminologies used to describe datasets and observed variables vary wildly in their granularity, quality, governance and interconnectivity which, in turn, limits their interoperability. The RDA endorsed I-ADOPT Framework addresses this issue by breaking down descriptions of observed variables into five well-defined atomic components ObjectofInterest, Property, Matrix, Constraint and Context anticipating their annotation with generic terms from FAIR semantic artefacts. As of today, the I-ADOPT decomposition is still a highly manual process that requires semantic and domain skills. Here, we propose the application of Large Language Models (LLM) to transform scientific terms into I-ADOPT-aligned descriptions. This model will enable the transformation into machine-interpretable representations by simply using natural language descriptions of observational research provided by domain experts. We will leverage the existing set of high-quality, human-made formalizations of I-ADOPT variables to adjust the LLM for this task. We will consider LLM in zero-shot scenarios where the LLM is used in its pretrained version and in-context learning where the LLM sees some examples of the task. We will also consider training specialist LLM where the LLM is further fine-tuned for this task, although the success of this approach depends on the amount of training data available. For developing this model and a first demonstrator, we will build on our previous experience in developing the I-ADOPT Framework, in transfer learning and fine-tuning neural networks, FAIR data stewardship, research data infrastructures and research software engineering. Our project will be further linked to several other ongoing activities and initiatives both on a national and also European level, which allows us to directly evaluate the performance of our LLM by potential end-users and communities. Such a service will be integrated into platforms like RoHub to help scientists make research datasets FAIR.

How to cite: Gómez Pérez, J. M. and García, A.: Automatic annotation following the I-ADOPT framework, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-20123, https://doi.org/10.5194/egusphere-egu25-20123, 2025.