A multi-modal semi-supervised model for ocean sediment lithology

John M. Aiken; Dunyu Liu; William Gilpin; Thorsten Becker

doi:https://doi.org/10.5194/egusphere-egu26-1492

[Back] [Session ITS1.11/ESSI1.10]

EGU26-1492, updated on 13 Mar 2026

https://doi.org/10.5194/egusphere-egu26-1492

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

A multi-modal semi-supervised model for ocean sediment lithology

John M. Aiken^1,2, Dunyu Liu³, William Gilpin⁴, and Thorsten Becker^3,5,6

John M. Aiken et al.

¹University of Oslo, Njord Centre, Physics, Utrecht, Norway (johnm.aiken@gmail.com)
²Expert Analytics, Oslo, Norway
³Institute for Geophysics, Jackson School for Geosciences The University of Texas at Austin, Texas, USA
⁴Department of Physics, The University of Texas at Austin, Austin, Texas, USA
⁵Department of Earth and Planetary Sciences, Jackson School for Geosciences, The University of Texas at Austin, Austin, Texas, USA
⁶Oden Institute for Computational Engineering & Sciences, The University of Texas at Austin, Austin, Texas, USA

Earth science data are typically highly heterogeneous which leads to mixed determined inverse problems and poses challenges to extract process-level information. For example, ocean sediment cores from the International Ocean Discovery Program (IODP) contain hundreds of millions of measurements across multiple geophysical properties, but usable datasets are only 5-10% complete due to missing data. We present a semi-supervised variational autoencoder with masked encoding that simultaneously imputes missing measurements and predicts lithology, enabling more complete utilization of legacy IODP archives. We train a masked variational autoencoder on the LILY database (89 km of core, 34 million observations, 42 IODP missions) to learn joint distributions across bulk density, magnetic susceptibility, RGB reflectance, and natural gamma ray attenuation. The model uses selective masking during training to learn imputation strategies for missing modalities. Crucially, the learned latent representations are constrained to recover lithological labels from unseen cores without retraining. We demonstrate that the model both captures the nonlinearities contained in the training data and is able to reconstruct the test data (R2_avg=0.86) and that data lithology (AUC_avg=0.9), while also providing descriptive embedding vectors (ARI=0.2). Additionally, the underlying data contains strong non-linear relationships that are not captured by simpler models on reconstruction (e.g., a typical LASSO-based regression (R2=0.24)). Our work represents a step towards scalable cross-modal assimilation and representation of existing earth datasets.

How to cite: Aiken, J. M., Liu, D., Gilpin, W., and Becker, T.: A multi-modal semi-supervised model for ocean sediment lithology, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-1492, https://doi.org/10.5194/egusphere-egu26-1492, 2026.