Enabling Seamless Provenance Collection in Large-Scale Machine Learning Tasks

Sandro Fiore; Gabriele Padovani; Takuya Kurihana; Massimiliano Fronza; Valentine Anantharaj

doi:https://doi.org/10.5194/egusphere-egu25-19892

[Back] [Session ESSI1.9]

EGU25-19892, updated on 15 Mar 2025

https://doi.org/10.5194/egusphere-egu25-19892

EGU General Assembly 2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Enabling Seamless Provenance Collection in Large-Scale Machine Learning Tasks

Sandro Fiore¹, Gabriele Padovani¹, Takuya Kurihana², Massimiliano Fronza¹, and Valentine Anantharaj

²

Sandro Fiore et al.

¹University of Trento, Department of Information Engineering and Computer Science, Italy
²Computational Sciences and Engineering Directorate, Oak Ridge National Laboratory, USA

The growing interest in deep learning and large language models (LLMs) in recent years highlights their remarkable adaptability and ability to generalize, drawing researchers from a wide array of disciplines. Despite their promise, in many instances, these advancements have exposed a lack of transparency and rigor during development processes. Although this rapid pace of research undoubtedly offers numerous benefits, it has also led to an increasing prevalence of works conducted without rigor and in a superficial way. Code that is not accompanied by documentation and results that are not reproducible inevitably lead to confusion among researchers and an environment in which trust is not a fundamental aspect of the proposed work. The complexity of data manipulation, characterized by ad hoc transformations, exacerbates these issues by hindering the traceability of processes, and hyperparameter tuning introduces additional difficulties, requiring repeated experimentation that consumes excessive computational resources, especially for large models.

To address these challenges, we introduce yProv4ML, a python library which provides an accessible option for tracking dataset and model statistics, hyperparameters, and energy metrics. It allows for the comparison of sets of experiments, and introduces a suite of directives to easily track the flow of information through provenance metadata.

yProv4ML is a component of the yProv framework, a research project on multi-level provenance management which provides scientists with a rich software ecosystem consisting of a web service to manage track and analyze provenance documents. Leveraging the PROV-JSON standard for provenance artifact recording, yProv4ML ensures comprehensive documentation and reproducibility while facilitating a seamless integration process similar to well-known libraries such as MLFlow.

During the last year, yProv4ML was integrated in a variety of use cases in different domains (i.e., Climate Science, High Energy Physics and Earth Observation) in the context of the interTwin (https://www.intertwin.eu/) and ClimateEurope2 (https://climateurope2.eu/) EU projects, as well as the ICSC Italian National Project (https://www.supercomputing-icsc.it/en/icsc-home/). The collection of provenance data in these use cases not only helped facilitate the reproducibility of experiments, but also helped diagnose performance bottlenecks and ensure the reliability and integrity of results, all of which are critical to advancing the field of large-scale ML in a trustworthy manner.

How to cite: Fiore, S., Padovani, G., Kurihana, T., Fronza, M., and Anantharaj, V.: Enabling Seamless Provenance Collection in Large-Scale Machine Learning Tasks, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-19892, https://doi.org/10.5194/egusphere-egu25-19892, 2025.