- University of Glasgow, MVLS, SBOHVM, United Kingdom of Great Britain – Northern Ireland (roderic.page@glasgow.ac.uk)
The published literature on biodiversity spans centuries, from accounts of expeditions to remote parts of the world, spectacular illustrations of new species (labelled with a Latin name), through to modern studies employing the latest technologies to understand how many species are on the planet, where those species live, and what they are doing.
Much of this literature now resides in the open access Biodiversity Heritage Library (BHL), perhaps the largest corpus of biodiversity literature available. BHL comprises over 63 million pages, spread across 200,000 books and journals in multiple languages. BHL content has been cited over 400,000 times by researchers.
Preliminary text mining suggests that BHL contains the original descriptions for some 400,000 species, with additional information on up to 3 million species. But this only scratches the surface of BHL’s potential. At the most basic level, content in BHL needs to be made discoverable and citable by researchers. ML tools are speeding up the discovery of these articles, enabling scholars to find them more easily. Inspired by the success of Pensoft and Plazi in making recent biodiversity knowledge findable, a second goal is to convert BHL content from static scans of pages into structured publications, so that centuries of knowledge become as accessible as if it were published today.
BHL is well known for the glorious colour plates of plants and animals, many created by artists in the 19th and 20th centuries. But BHL is also full of detailed line drawings of species, maps of their distribution, photos of their habitats, and sonograms of their calls. These images are buried within the corpus, but new Large Language Models (LLMs) can be used to identify and extract them, enhancing efforts to build tools to automatically identify species from images, as well as documenting how distributions and habitats have changed over time.
The talk will present examples of using ML to extract images and structured data from BHL at scale, and outline the future role BHL can play in making fundamental biodiversity knowledge vastly more discoverable and accessible.
How to cite: Page, R.: Unlocking Centuries of Biodiversity Knowledge: Machine Learning and the Biodiversity Heritage Library, World Biodiversity Forum 2026, Davos, Switzerland, 14–19 Jun 2026, WBF2026-822, https://doi.org/10.5194/wbf2026-822, 2026.