- 1Oak Ridge National Laboratory, National Center for Computational Sciences, Oak Ridge, United States of America (anantharajvg@ornl.gov)
- 2Information Engineering and Computer Science Department, University of Trento, Italy
AI foundation models have already demonstrated their usefulness in harnessing their potential in a wide range of science application domains. They derive their power from the large volumes of data, along with the computational methods used to exploit them using unprecedented amounts of compute power. We are inundated with data but have managed to exploit only a small fraction of the available data. The Earth System Grid Federation (ESGF) is hosting nearly 16 PB of data collection from the Coupled Model Intercomparison Project (CMIP6), expected to grow to 5 - 10 more in the CMIP7 era. The NASA Earth Observation Data and Information System (EOSDIS) archive is expected to exceed 600 PB by 2030.
AI-enabled solutions will require integrating multimodal data while being cognizant of the energy footprint introduced by the data and the computational methods. Currently, the energy consumption of transformer-based foundation models scale with the amount of data and corresponding model sizes. This impediment needs to be mitigated by developing data-efficient methods that lead to energy efficiency as well across all scales. There is little guidance in the research community on developing a computational plan for the optimal use of the resources for developing foundation models using multimodal scientific data. The benchmarks based on LLM scaling are still insufficient for vision transformers (ViTs), commonly adopted for geoscientific applications. We need a suite of community benchmarks based on ViT backbones and other methods at different scales to understand energy efficient methods for different classes of science problems.
Relatively few studies have focused on the issue of data efficiency for training science foundation models. We have adopted a smart sampling approach to extract the most informative samples is an effective means of significantly reducing the training data. We trained two ViT models, one with all available MODIS data over the ocean and another using an intelligently-sampled subset. We applied the models to classify clouds over the ocean. Our preliminary results indicate that reasonably accurate models can be trained with only a fraction of total training data. Improvements in reduction of data translate directly into improvements for energy efficiency.
How to cite: Anantharaj, V., Kurihana, T., Padovani, G., and Fiore, S.: Data efficiency: The master key for unlocking energy efficiency, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-14068, https://doi.org/10.5194/egusphere-egu25-14068, 2025.