About

Optical Character Recognition (OCR) is a key component of large-scale digitisation projects that deal with text-based material. While research centres in the UK (including the collaborators in the proposed project) are building up impressive portfolios of skills in the digitisation of historical and archive material, for OCR they typically make use of closed, proprietary software or commercial companies. This raises a number of issues: the cost of proprietary software and/or external consultants; the relative lack of flexibility and adaptability of closed, “black box” software; the deskilling of digitisation centre staff, as OCR expertise is concentrated in commercial hands; the appropriateness of the software (typically developed for commercial applications) to historical material.

The OCRopodium project aims to address this by:

  • Trialling an open source approach to Optical Character Recognition, using the OCRopus software.
  • Embedding OCR activities within flexible, semi-automated digitisation workflows for text-based material.

The OCRopus software will be trialled and evaluated using the outputs of a number of completed digitisation projects at CDDA, selected for the variety of the OCR capture processes that they involved. We will not just compare the outputs of OCRopus with those of the original projects; we will also investigate the potential that arises specifically from its status as open source, in particular the ability to investigate, explain and modify the software’s behaviour, and to adapt it to meet specific needs and handle specialised materials.

We will also implement OCR activities within a collaborative, distributed and semi-automated workflow system that is embedded within institutional practice at the project partners. This workflow will address the digitisation process, from scanning, through OCR and mark-up, to ingest within a digital repository where the content will be managed and preserved. The workflow system will be configurable, allowing the incorporation of different OCRopus plug-ins or other processing modules (such as text mining), as well as allowing for human interaction, such as reviewing and providing feedback on OCR results.

The proposed project represents an important step towards embedding a full portfolio of digitisation-related skills within expert digitisation centres, and closing a “skills gap” by reducing dependence on commercial OCR providers in favour of open source OCR technology, which can be adapted and improved through community development.