EGU24-1375, updated on 08 Mar 2024
https://doi.org/10.5194/egusphere-egu24-1375
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Acid sulfate soil mapping in western Finland: How to work with imbalanced datasets and machine learning

Virginia Estévez1, Stefan Mattbäck2,3, Anton Boman3, Pauliina Liwata-Kenttälä3, Kaj-Mikael Björk1, and Peter Österholm2
Virginia Estévez et al.
  • 1Arcada University of Applied Sciences, Finland (estevezv@arcada.fi)
  • 2Geology and Mineralogy, Åbo Akademi University, Finland
  • 3Geological Survey of Finland, Finland

One of the main challenges in digital soil mapping is the imbalanced datasets for soils classification. For these datasets, machine learning techniques use to overestimate the majority classes and underestimate the minority ones. In general, this generates maps with poor precision and unrealistic results. Considering these maps for land use decision-making can have dire consequences. This is the case of acid sulfate (AS) soils, a type of harmful soil that can generate serious environmental damage when drained in agricultural or forestry activities. In the study area, the probability of finding AS soils is very high. Furthermore, some of the most hazardous AS soils in Finland are located there [1]. Therefore, it is necessary to create high-precision maps to avoid environmental damage. Since the dataset for this region is highly imbalanced, the first step in creating accurate maps is to balance the dataset. Although most  soil class datasets in nature are imbalanced, this problem has been hardly studied. In this work, we analyze different techniques to address the problem of imbalanced datasets. The methods considered to balance the dataset are under- and oversampling techniques and the combination of both. For the oversampling of the minority class, we create a kind of artificial samples from the quaternary geological map. The method used for the modeling is Random Forest, one of the best methods for the classification of AS soils [2,3]. Balancing the dataset improves the performance of the model in all the studied cases, where the values of the metrics for both classes are above 80%. Furthermore, we create AS soil probability maps for the four balanced datasets and the imbalanced dataset. A detailed comparison between the maps is made. In addition, the extent of the AS soils obtained in all the cases is compared with the extent of the AS soils in the conventionally produced occurrence map [1]. The modeled probability maps created from the balanced datasets have a high precision. The results of this study confirm the importance of balancing the dataset to improve the prediction and classification of AS soils.

[1] Geological Survey of Finland. Acid Sulfate Soils–map services http://gtkdata.gtk.fi/hasu/index.html 

[2] V. Estévez et al. 2022.  “Machine learning techniques for acid sulfate soil mapping in southeastern Finland”. Geoderma 406, 115446.

[3] V. Estévez et al. 2023. “Improving prediction accuracy for acid sulfate soil mapping by means of variable selection”. Front. Environ. Sci. 11:1213069.  doi: 10.3389/fenvs.2023.1213069

 

 

How to cite: Estévez, V., Mattbäck, S., Boman, A., Liwata-Kenttälä, P., Björk, K.-M., and Österholm, P.: Acid sulfate soil mapping in western Finland: How to work with imbalanced datasets and machine learning, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-1375, https://doi.org/10.5194/egusphere-egu24-1375, 2024.