Double machine learning for geosciences
- 1Image Processing Laboratory (IPL), University of València, Valencia, Spain (kai.cohrs@uv.es)
- 2Department of Biogeochemical Integration, Max Planck Institute for Biogeochemistry, Jena, Germany
Hybrid modeling describes the synergy between parametric models and machine learning [1]. Parts of a parametric equation are substituted by non-parametric machine learning models, which can then represent complex functions. These are inferred together with the parameters of the equation from the data. Hybrid modeling promises to describe complex relationships and to be scientifically interpretable. These promises, however, need to be taken with a grain of salt. With too flexible models, such as deep neural networks, the problem of equifinality arises: There is no identifiable optimal solution. Instead, many outcomes describe the data equally well, and we will obtain one of them by chance. Interpreting the result may lead to erroneous conclusions. Moreover, studies have shown that regularization techniques can introduce a bias on jointly estimated physical parameters [1].
We propose double machine learning (DML) to solve these problems [2]. DML is a theoretically well-founded technique for fitting semi-parametric models, i.e., models consisting of a parametric and a non-parametric component. DML is widely used for debiased treatment effect estimation in economics. We showcase its use for geosciences on two problems related to carbon dioxide fluxes:
- Flux partitioning, which aims at separating the net carbon flux (NEE) into its main contributing gross fluxes, namely, RECO and GPP.
- Estimation of the temperature sensitivity parameter of ecosystem respiration Q10.
First, we show that in the case of synthetic data for Q10 estimation, we can consistently retrieve the true value of Q10 where the naive neural network approach fails. We further apply DML to the carbon flux partitioning problem and find that it is 1) able to retrieve the true fluxes of synthetic data, even in the presence of strong (and more realistic) heteroscedastic noise, 2) retrieves main gross carbon fluxes on real data consistent with established methods, and 3) allows us to causally interpret the retrieved GPP as the direct effect of the photosynthetically active radiation on NEE. This way, the DML approach can be seen as a causally interpretable, semi-parametric version of the established daytime methods. We also investigate the functional relationships inferred with DML and the drivers modulating the obtained light-use efficiency function. In conclusion, DML offers a solid framework to develop hybrid and semiparametric modeling and can be of widespread use in geosciences.
[1] Reichstein, Markus, et al. “Combining system modeling and machine learning into hybrid ecosystem modeling.” Knowledge-Guided Machine Learning (2022). https://doi.org/10.1201/9781003143376-14
[2] Chernozhukov, Victor, et al. “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal, Volume 21, Issue 1, 1 (2018): C1–C68. https://doi.org/10.1111/ectj.12097
How to cite: Cohrs, K.-H., Varando, G., Reichstein, M., and Camps-Valls, G.: Double machine learning for geosciences, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13462, https://doi.org/10.5194/egusphere-egu23-13462, 2023.