EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Curation of High-level Molecular Atmospheric Data for Machine Learning Purposes

Vitus Besel1, Milica Todorović2, Theo Kurtén1, Patrick Rinke3, and Hanna Vehkamäki1
Vitus Besel et al.
  • 1Institute for Atmosphere and Earth System Research, University of Helsinki, Helsinki, Finland (
  • 2University of Turku, Turku, Finland
  • 3Aalto University, Espoo, Finland

As cloud and aerosol interactions remain large uncertainties in current climate models (IPCC) they are of special interest for atmospheric science. It is estimated that more than 70% of all cloud condensation nuclei origin from so-called New Particle Formation, which is the process of gaseous precursors clustering together in the atmosphere and subsequent growth into particles and aerosols. After initial clustering this growth is driven strongly by condensation of low volatile organic compounds (LVOC), that is molecules with saturation vapor pressures (pSat) below 10-6 mbar [1]. These origin from organic molecules emitted by vegetation that are subsequently rapidly oxidized in the air, so-called Biogenic LVOC (BLVOC).

We have created a big data set of BLVOC using high-throughput computing and Density Functional Theory (DFT), and use it to train Machine Learning models to predict pSat of previously unseen BLVOC. Figure 1 illustrates some sample molecules form the data.

Figure 1: Sample molecules, for small, medium large sizes.     Figure 2: Histogram of the calculated saturation vapor pressures.

Initially the chemical mechanism GECKO-A provides possible BLVOC molecules in the form of SMILES strings. In a first step the COSMOconf program finds and optimizes structures of possible conformers and provides their energies for the liquid phase on a DFT level of theory. After an additional calculation of the gas phase energies with Turbomole, COSMOtherm calculates thermodynamical properties, such as the pSat, using the COSMO-RS [1] model. We compressed all these computations together in a highly parallelised high-throughput workflow to calculate 32k BLVOC, that include over 7 Mio. molecular conformers. See a histogram of the calculated pSat in Figure 2.

We use the calculated pSat to train a Gaussian Process Regression (GPR) machine learning model with the Topological Fingerprint as descriptor for molecular structures. The GPR incorporates noise and outputs uncertainties for predictions on the pSat. These uncertainties and data cluster techniques allow for the active choosing of molecules to include in the training data, so-called Active Learning. Further, we explore using SLISEMAP [2] explainable AI methods to correlate Machine Learning predictions, the high-dimensional descriptors and human-readable properties, such as functional groups.

[1] Metzger, A. et al. Evidence for the role of organics in aerosol particle formation under atmospheric conditions. Proc. Natl. Acad. Sci. 107, 6646–6651, 10.1073/pnas.0911330107 (2010)
[2] Klamt, A. & Schüürmann, G. Cosmo: a new approach to dielectric screening in solvents with explicit expressions for the
screening energy and its gradient. J. Chem. Soc., Perkin Trans. 2 799–805, 10.1039/P29930000799 (1993).
[3] Björklund, A., Mäkelä, J. & Puolamäki, K. SLISEMAP: supervised dimensionality reduction through local explanations. Mach Learn (2022).

How to cite: Besel, V., Todorović, M., Kurtén, T., Rinke, P., and Vehkamäki, H.: Curation of High-level Molecular Atmospheric Data for Machine Learning Purposes, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-1135,, 2023.

Supplementary materials

Supplementary material file