The Unreliable Narrator: LSTM Internal States Fluctuate with Software Environments Despite Robust Predictions

Ryosuke Nagumo; Ross Woods; Miguel Rico-Ramirez

doi:https://doi.org/10.5194/egusphere-egu26-1897

[Back] [Session EOS4.4]

EGU26-1897, updated on 13 Mar 2026

https://doi.org/10.5194/egusphere-egu26-1897

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

The Unreliable Narrator: LSTM Internal States Fluctuate with Software Environments Despite Robust Predictions

Ryosuke Nagumo, Ross Woods, and Miguel Rico-Ramirez

Ryosuke Nagumo et al.

University of Bristol, Faculty of Engineering, School of Civil, Aerospace and Design Engineering, Bristol, United Kingdom of Great Britain – England, Scotland, Wales (ross.woods@bristol.ac.uk)

Since the robust performance of Long Short-Term Memory (LSTM) networks was established, their physics-awareness and interpretability have become central topics in hydrology. Seminal works (e.g., Lees et al. (2022)) have argued that LSTM internal states spontaneously capture hydrological concepts, and suggested that cell states can represent soil moisture dynamics despite not being explicitly trained on such data. Conversely, more recent studies (e.g., Fuente et al. (2024)) demonstrated that mathematical equifinality causes non-unique LSTM representations with different initialisations.

In this work, we report an arguably more systematic "bug" in the software environment that causes instability in internal states. We initially aimed to investigate how internal states behave differently when trained with or without historical observation data. We encountered this issue while reassembling a computational stack and attempting to replicate the initial results, as the original Docker environment was not preserved. While random seeds have been indicated to lead to different internal state trajectories, we found the computational backend (e.g., changing CUDA versions, PyTorch releases, or dependent libraries) also produces them. These are the findings:

In gauged catchments: Discharge predictions remained stable (in one catchment, NSE was 0.88 ± 0.01) across computational environments, yet the internal temporal variations (e.g., silhouette, mean, and std of cell states) fluctuated noticeably.
In pseudo-ungauged scenarios: The prediction performance itself became more reliant on the computational environment (in the same catchment, NSE dropped to 0.31 ± 0.15), yet the internal temporal variations of the cell states fluctuated only as much as they did during the gauged scenario.

These findings suggests that instability in the computational environment poses not only a risk of altering interpretability in training (by altering internal states) but also casts doubt on reliability in extrapolation (by altering outputs).

It is worth mentioning that we confirmed this is not a replicability issue; completely identical cell states and predictions are produced when the computational environment, seeds, and training data are held constant. We argue that such stability must be established as a standard benchmark before assigning physical meaning to deep learning internals.

How to cite: Nagumo, R., Woods, R., and Rico-Ramirez, M.: The Unreliable Narrator: LSTM Internal States Fluctuate with Software Environments Despite Robust Predictions, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-1897, https://doi.org/10.5194/egusphere-egu26-1897, 2026.