RESOURCE: Using Kraken to Train Your Own OCR Models

Christine Roughan, PhD student at NYU, has created a guide on how to train and implement OCR models using Kraken. Kraken is open-source command line software for performing OCR on text, and offers both pre-trained OCR models and the ability to produce artificial training data from a text provided by the user.

This guide is a basic walkthrough on downloading and running Kraken, preparing artificial training data, generating artificial training data, training and fine-tuning your model, and performing OCR on your text(s). The author uses an Arabic text as an example, but the guide’s steps are reproducible with any language. It is worth noting that the walkthrough does not cover initial preparation of the images to be processed, so if starting from a PDF the pages will have to be separated into individual image files using a tool like pdftoppm or ImageMagick’s convert tool. The author notes that she has been able to use Kraken with PNG, TIFF, and JPG files.

This resource is a very helpful introduction to using Kraken for performing OCR and creating your own training data. It will be of particular interest to anyone working with non-Roman languages or who would like to train and implement their own OCR models rather than relying on pre-made models that come packaged with OCR software.

dh+lib Review

This post was produced through a cooperation between Kristina De Voe, Megan Macken, Melissa Patton, Kate Thornhill, Tierney Gleason, Molly Castro, Esther Brandon, Claudia Berger, and Anne Ladyem McDivitt (Editors-at-large for the week), Ian Goodale (Editor for the week), and Caitlin Christian-Lamb, Linsey Ford, Pamella Lach, and Nickoal Eichmann-Kalwara (dh+lib Review Editors).