LLM Workflow for Land-Use Prediction Evidence Synthesis: Efficient Screening, Selective Refusals, Reportable Gaps

Ahmed Derdouri; Yoshifumi Masago

doi:https://doi.org/10.5194/egusphere-egu26-15955

[Back] [Session ESSI1.2]

EGU26-15955, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-15955

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

LLM Workflow for Land-Use Prediction Evidence Synthesis: Efficient Screening, Selective Refusals, Reportable Gaps

Ahmed Derdouri and Yoshifumi Masago

Center for Climate Change Adaptation, National Institute for Environmental Studies, Tsukuba, Japan (derdouri.ahmed@nies.go.jp)

Large language models (LLMs) promise to make systematic reviews more scalable and less costly, but the validity of LLM-assisted evidence synthesis depends not only on accuracy, but also on which parts of the literature are effectively visible to a deployed model and how reliably they are interpreted. We report a large-scale, domain-specific evaluation of an end-to-end LLM-assisted workflow for a systematic review of national-scale land use/land cover (LULC) prediction research (11,817 records; 11,688 after de-duplication), using a single hosted LLM deployment (Qwen Max) as a concrete case study. At title–abstract screening, the model behaved as a recall-oriented filter, excluding 9,891/11,688 records (84.7%) and routing 1,797 records for human follow-up; compared with the human baseline, it excluded fewer studies (84.7% vs 91.8%) and shifted more records into OK and POSSIBLE (OK: 4.2% vs 1.5%; POSSIBLE: 10.6% vs 5.5%). For full-text extraction, structured fields showed high agreement with expert coding across 342 benchmark papers (mean scores: 0.84 categorical, 0.85 temporal, 0.87 set-based), whereas free-text summaries were more variable (mean 0.79 overall; cosine similarity 0.51–0.87 across narrative fields despite high BERT-F1). In our case study, the workflow was completed in approximately one day on a single workstation for ~US$106 in API costs. Critically, full-text processing also produced explicit refusals: 7/2,084 candidate papers in deep screening and 2/345 papers targeted for insight extraction were blocked as “sensitive” geopolitical content. Although rare, these refusals were non-random and concentrated in contested regions, illustrating how LLM-specific constraints can introduce structured missingness that systematically removes or misinterprets evidence in precisely those settings where land-use conflict and governance are most salient. LLM-assisted reviews can therefore make previously prohibitive syntheses tractable. However, they must be embedded in transparent, human-led workflows that monitor and log model failures including refusals, omissions, and misreadings, and apply targeted auditing to detect and correct systematic blind spots.

How to cite: Derdouri, A. and Masago, Y.: LLM Workflow for Land-Use Prediction Evidence Synthesis: Efficient Screening, Selective Refusals, Reportable Gaps, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-15955, https://doi.org/10.5194/egusphere-egu26-15955, 2026.