Gamera Addon: OCR Toolkit
This is a Gamera toolkit for building standard text recognition applications. It is based on the Gamera framework and requires a working Gamera installation.
About the OCR toolkit
The OCR Toolkit is meant to help building optical character recognition (OCR) systems for standard text documents. Even though it can be used as is, it is specifically designed to make individual steps of the recognition system customizable and replacable. It provides:
- a flexible mechanism for plugging in custom page segmentation algorithms
- heuristic rules for dealing with diacritics, and for disambiguation of commonly confused roman characters (like comma and apostrophe, or lower and upper case 'W')
- a ready-to-run python script ocr4gamera.py which acts as a basic OCR-system. Note however, that the character training must be done beforehand by the user: the toolkit does not include any training data.
A detailed documentation is included with the source code package in the subdirectory doc/html. A comprehensive overview of design, usage and customization of the OCR toolkit can be found in the paper
C. Dalitz, R. Baston: Optical Character Recognition with the Gamera Framework. In C. Dalitz (Ed.): "Document Image Analysis with the Gamera Framework." Schriftenreihe des Fachbereichs Elektrotechnik und Informatik, Hochschule Niederrhein, vol. 8, pp. 53-65, Shaker Verlag (2009)
Authors and Acknowledgements
The authors of the OCR toolkit are:
Thanks to Jakub Wilk for providing valuable feedback on this toolkit.
The source code of the OCR toolkit is freely distributed under the terms of the GNU General Public License. Note that the toolkit requires a working installation of Gamera. Available file releases are:
For release notes, see the file CHANGES. For installation and usage instructions see the file doc/html/index.html in the source package. When all prerequisites are installed, installation simply requires typing
python setup.py build && sudo python setup.py install