- 1University of Oslo, Njord Centre, Physics, Utrecht, Norway (johnm.aiken@gmail.com)
- 2Expert Analytics, Oslo, Norway
- 3Institute for Geophysics, Jackson School for Geosciences The University of Texas at Austin, Texas, USA
- 4Department of Physics, The University of Texas at Austin, Austin, Texas, USA
- 5Department of Earth and Planetary Sciences, Jackson School for Geosciences, The University of Texas at Austin, Austin, Texas, USA
- 6Oden Institute for Computational Engineering & Sciences, The University of Texas at Austin, Austin, Texas, USA
Earth science data are typically highly heterogeneous which leads to mixed determined inverse problems and poses challenges to extract process-level information. For example, ocean sediment cores from the International Ocean Discovery Program (IODP) contain hundreds of millions of measurements across multiple geophysical properties, but usable datasets are only 5-10% complete due to missing data. We present a semi-supervised variational autoencoder with masked encoding that simultaneously imputes missing measurements and predicts lithology, enabling more complete utilization of legacy IODP archives. We train a masked variational autoencoder on the LILY database (89 km of core, 34 million observations, 42 IODP missions) to learn joint distributions across bulk density, magnetic susceptibility, RGB reflectance, and natural gamma ray attenuation. The model uses selective masking during training to learn imputation strategies for missing modalities. Crucially, the learned latent representations are constrained to recover lithological labels from unseen cores without retraining. We demonstrate that the model both captures the nonlinearities contained in the training data and is able to reconstruct the test data (R2_avg=0.86) and that data lithology (AUC_avg=0.9), while also providing descriptive embedding vectors (ARI=0.2). Additionally, the underlying data contains strong non-linear relationships that are not captured by simpler models on reconstruction (e.g., a typical LASSO-based regression (R2=0.24)). Our work represents a step towards scalable cross-modal assimilation and representation of existing earth datasets.
How to cite: Aiken, J. M., Liu, D., Gilpin, W., and Becker, T.: A multi-modal semi-supervised model for ocean sediment lithology, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-1492, https://doi.org/10.5194/egusphere-egu26-1492, 2026.