- 1massachusetts institute of technology, woods hole oceanographic institution, Oceanography, United States of America (eculhane@mit.edu)
- 2Open Atlas (ezekiel@open-atlas.com)
Recent advances in large vision–language models (VLMs) offer new opportunities to extract structured, biologically relevant information from unstructured, image-like data, including earth observation satellite and ocean observation marine remote sensing modalities. We present a lightweight pipeline that uses few-shot, instruction-aligned VLMs to convert visual inputs into concise, schema-based JSON records capable of capturing specific, but flexible scene content such as habitat characteristics, disturbance events, species labels, data-quality attributes, Essential Biodiversity Variables (EBV) or any user-defined ecological indicators. This system enables rapid generation of consistent annotations and structured data extraction across massive archives without task-specific model training, complex engineering, or human labels. We demonstrate the usefulness of this approach through practical applications to automated quality-control tagging, low-level land-use and habitat-type classification, species classification, and numerical feature estimation from 2-D images generated by satellite platforms as well as complementary sensors such as shipboard acoustics (EK60) and S-band radar. We optimize performance generally and on a task-specific basis through the incorporation of human-like spatial reasoning via grid-referenced subregion analysis and prompt-optimization frameworks such as DSPy for declarative prompt programming and self-improvement. By producing interpretable, reproducible and harmonized annotations at scale, our approach substantially reduces the manual screening effort required to curate multi-sensor datasets, prioritizes scenes for higher-fidelity processing and supports sophisticated cross-platform analysis aligned with biodiversity applications. VLM technologies are rapidly reshaping environmental data management, and our results provide an early, practical demonstration of how VLM-based visual interpretation can enhance the flexibility, scalability, and interoperability of remote-sensing pipelines for biodiversity monitoring. Moreover, these capabilities directly support key reporting needs under the Kunming–Montreal Global Biodiversity Framework, particularly Targets 1, 2, 4, and 19 that require scalable information on ecosystem extent, condition, disturbance, and data accessibility. They also contribute to SDG indicators (e.g., 15.1.1, 15.3.1, 14.2.1) by enabling rapid, harmonized extraction of habitat, land-use, and marine-ecosystem attributes from multi-sensor Earth observation archives.
How to cite: Culhane, E. and Barnett, E.: Scalable AI‑assisted annotation of remote sensing imagery, World Biodiversity Forum 2026, Davos, Switzerland, 14–19 Jun 2026, WBF2026-550, https://doi.org/10.5194/wbf2026-550, 2026.