EGU23-13462, updated on 10 Jan 2024
https://doi.org/10.5194/egusphere-egu23-13462
EGU General Assembly 2023
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Double machine learning for geosciences

Kai-Hendrik Cohrs1, Gherardo Varando1, Markus Reichstein2, and Gustau Camps-Valls1
Kai-Hendrik Cohrs et al.
  • 1Image Processing Laboratory (IPL), University of València, Valencia, Spain (kai.cohrs@uv.es)
  • 2Department of Biogeochemical Integration, Max Planck Institute for Biogeochemistry, Jena, Germany

Hybrid modeling describes the synergy between parametric models and machine learning [1]. Parts of a parametric equation are substituted by non-parametric machine learning models, which can then represent complex functions. These are inferred together with the parameters of the equation from the data. Hybrid modeling promises to describe complex relationships and to be scientifically interpretable. These promises, however, need to be taken with a grain of salt. With too flexible models, such as deep neural networks, the problem of equifinality arises: There is no identifiable optimal solution. Instead, many outcomes describe the data equally well, and we will obtain one of them by chance. Interpreting the result may lead to erroneous conclusions. Moreover, studies have shown that regularization techniques can introduce a bias on jointly estimated physical parameters [1].

We propose double machine learning (DML) to solve these problems [2]. DML is a theoretically well-founded technique for fitting semi-parametric models, i.e., models consisting of a parametric and a non-parametric component. DML is widely used for debiased treatment effect estimation in economics. We showcase its use for geosciences on two problems related to carbon dioxide fluxes: 

  • Flux partitioning, which aims at separating the net carbon flux (NEE) into its main contributing gross fluxes, namely, RECO and GPP.
  • Estimation of the temperature sensitivity parameter of ecosystem respiration Q10.

First, we show that in the case of synthetic data for Q10 estimation, we can consistently retrieve the true value of Q10 where the naive neural network approach fails. We further apply DML to the carbon flux partitioning problem and find that it is 1) able to retrieve the true fluxes of synthetic data, even in the presence of strong (and more realistic) heteroscedastic noise, 2) retrieves main gross carbon fluxes on real data consistent with established methods, and 3) allows us to causally interpret the retrieved GPP as the direct effect of the photosynthetically active radiation on NEE. This way, the DML approach can be seen as a causally interpretable, semi-parametric version of the established daytime methods. We also investigate the functional relationships inferred with DML and the drivers modulating the obtained light-use efficiency function. In conclusion, DML offers a solid framework to develop hybrid and semiparametric modeling and can be of widespread use in geosciences.

 

[1] Reichstein, Markus, et al. “Combining system modeling and machine learning into hybrid ecosystem modeling.” Knowledge-Guided Machine Learning (2022). https://doi.org/10.1201/9781003143376-14

[2] Chernozhukov, Victor, et al. “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal, Volume 21, Issue 1, 1 (2018): C1–C68. https://doi.org/10.1111/ectj.12097

How to cite: Cohrs, K.-H., Varando, G., Reichstein, M., and Camps-Valls, G.: Double machine learning for geosciences, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13462, https://doi.org/10.5194/egusphere-egu23-13462, 2023.