Automatic Biodiversity Literature Enhancement

We have recently been awarded funding by the Joint Information Systems Committee (JISC) for a feasibility study to improve access to the scanned biodiversity literature. The Automatic Biodiversity Literature Enhancement project is funded by JISC as part of the JISC Digitisation Programme, under the Enriching Digital Resources call for proposals. The project is funded for one year, to a value of £73,261. It is a joint project with our projec collaborators being The Natural History Museum, London.

The overall aim of the project is to establish and extend information extraction techniques from scanned taxonomic literature in the Biodiversity Heritage Library . Scanned texts contain errors introduced by imperfect OCR and other sources, so techniques are required that are robust in the face of such errors.

Page image scan from the Biodiversity Heritage LibraryThe ABLE project aims to extract mark-up and meta-data from scanned literature in the biodiversity domain. The meta-data we aim to extract includes proper nouns (taxon, people and place names) and dates. We also intend to enhance the searchability of those terms using associative techniques from Natural Language Processing combined with likely Optical Character Recognition (OCR) errors. For example, by allowing the recovery of Pioa against a search for Pica, provided the context of Pioa is a bird, ideally a magpie. As source data, we will work with volumes of the biodiversity literature that will be scanned as part of the Biodiversity Heritage Library project. If fully successful the software developed in the ABLE project will be applied to the BHL library of over 7 million pages. The BHL scanners produce a structural XML output and a small part of the project will look at the feasibility of developing software to create compatible files starting from plain image scans.