Gamera classifier API

Last modified: May 11, 2016

Contents

Introduction

For manual training of a classifier, you will generally want to use the interactive classifier GUI. This document describes the programming API that is used by scripts that make use of a classifier.

At present, Gamera supports segmentation-based image classification. This means that the page image is first segmented into individual connected components (or glyphs). Each of these glyphs has a number of features generated from it. These features (collectively called a "feature vector") are then used inside a classifier which, using a database of training data, identifies the glyph.

Interactive vs. Noninteractive classifiers

All classifiers in Gamera support the same core Classifier API (interface), so they are interchangeable. There is an important distinction between two families of classifiers, however:

Interactive classifier
An interactive classifier can have training examples added to it in real time, and the results used immediately to classify glyphs. Interactive classifiers are useful during the training process since the classifier can be "boot strapped" with a few examples and refined interactively.
Noninteractive classifier
Noninteractive classifiers take a complete database of training data and then create an optimised data structure for classification. Because building that data structure can take a considerable amount of time, new training examples cannot be added on the fly. In addition, noninteractive classifiers are serializable to binary classifier-specific file formats, which save and load much faster than the Gamera XML format.

Types of classifiers

Within each of these families, different classifiers are available. These "concrete" classifiers have additional methods specific to the particular classifier type. The currently implemented classifiers are all k - nearest-neighbor, but we plan to add other classifiers as needed.

The classifier interface

This section describes each method of the classifier interface. The base class for all classifiers is Classifier, from which two classes NonInteractiveClassifier and InteractiveClassifier are derived. As noninteractive classifiers are more limited in their interface, this divides the methods into two categories:

Core
Methods available to all classifiers
Interactive
Methods available only to interactive classifiers

Core

The following methods and properties are available to all classifiers.

Properties

Each classifier has the following member variables:

_database
List of glyphs used as training data. This is a private property that can only be accessed through the methods get_glyphs and set_glyphs, or it is set in the constructor or with from_xml_filename. Note that the return value of get_glyphs must be converted to a list with list(classifier.get_glyphs()).
confidence_types
List of confidence types tht are to be computed during classification. The confidence types must be from the predefined confidence constants.

Initialization

As the base class Classifier does not have an explicit constructor, the constructor of NonInteractiveClassifier is described here.

__init__

NonInteractiveClassifier (ImageList database = [], bool perform_splits = True)

Creates a new classifier instance.

database

Can come in two forms:

  • When a list (or Python iterable) each element is a glyph to use as training data for the classifier
  • For non-interactive classifiers only, when database is a filename, the classifier will be "unserialized" from the given file.

Any images in the list that were manually classified (have classification_state == MANUAL) will be used as training data for the classifier. Any UNCLASSIFIED or AUTOMATICALLY classified images will be ignored.

When initializing a noninteractive classifier, the database must be non-empty.

perform_splits

If perform_splits is True, glyphs trained with names beginning with _split. are run through a given splitting algorithm. For instance, glyphs that need to be broken into upper and lower halves for further classification of those parts would be trained as _split.splity. When the automatic classifier encounters glyphs that most closely match those trained as _split, it will perform the splitting algorithm and then continue to recursively classify its parts.

The splitting algorithms are documented in the plugin documentation.

New splitting algorithms can be created by writing plugin methods in the category Segmentation.

Classification

The following methods deal with classifying glyphs on a individual level.

classify_glyph_automatic

classify_glyph_automatic (Image glyph, int max_recursion = 10)

Classifies a glyph and sets its classification_state and id_name. (If you don't want the glyph changed, use guess_glyph_automatic.)

glyph
The glyph to classify.
max_recursion (optional)
Limit the number of split recursions.

Returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying glyph as a split (See Initialization). remove is a list of glyphs that are no longer valid due to reclassifying glyph from a split to something else. Most often, both of these lists will be empty. You will normally want to use these lists to update the collection of glyphs on the current page.

classify_list_automatic

classify_list_automatic (ImageList glyphs, int max_recursion = 10)

Classifies a list of glyphs and sets the classification_state and id_name of each glyph. (If you don't want it set, use guess_glyph_automatic.)

glyphs
A list of glyphs to classify.
max_recursion
The maximum level of recursion to follow when splitting glyphs. Since some glyphs will split into parts that then classify as _split in turn, a maximum depth should be set to avoid infinite recursion. This number can normally be set quite low, depending on the application.

Return type: (add, remove)

The list glyphs is never modified by the function. Instead, it returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying glyphs as a split (See Initialization). remove is a list of glyphs that are no longer valid due to reclassifying glyphs from a split to something else. Most often, both of these lists will be empty. You will normally want to use these lists to update the collection of glyphs on the current page. If you just want a new list returned with these updates already made, use classify_and_update_list_automatic.

classify_and_update_list_automatic

classify_and_update_list_automatic (ImageList glyphs, Int max_recursion = 10)

A convenience wrapper around classify_list_automatic that returns a list of glyphs that is already updated based on splitting.

guess_glyph_automatic

(id_name, confidencemap) guess_glyph_automatic (Image glyph)

Classifies the given glyph without setting its classification. The return value is a tuple of the form (id_name,confidencemap), where idname is a list of the form idname, and confidencemap is a map of the form confidence listing the confidences of the main id.

classify_with_images

(id_name, confidencemap) classify_with_images (ImageList glyphs, Image glyph, bool cross_validation_mode=False, bool do_confidence=True )

Classifies an unknown image using the given list of images as training data. The glyph is classified without setting its classification. The return value is a tuple of the form (id_name,confidencemap), where idname is a list of the form idname, and confidencemap is a map of the form confidence listing the confidences of the main id.

Grouping

Often, characters do not cleanly correspond to connected components. For instance, broken or degraded printing may disconnect parts of a character, or characters, such as i may always be made up of two connected components. The grouping algorithm is designed to deal with those cases. It attempts to group connected components with others nearby in order to create groupings that are more like glyphs in the database. Needless to say, this approach is much slower than the "one-connected-component-at-a-time" approach, but can produce considerably better results on certain images.

To train for grouping, images corresponding to the entire character must exist in the database. For instance, in the Gamera GUI, one would select both the dot and stem of a lower-case i and train it as _group.lower.i. This will join the two connected components into a single image and then add it to the database.

The algorithm is described in more detail in our paper on correcting broken characters (PDF).

group_list_automatic

group_list_automatic (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_parts_per_group = 4, int max_graph_size = 16, criterion = 'min')

Classifies the given list of glyphs. Adjacent glyphs are joined together if doing so results in a higher global confidence. Each part of a joined glyph is classified as HEURISTIC with the prefix _group.

glyphs
The list of glyphs to group and classify.
grouping_function

A function that determines how glyphs are initially combined. This function must take exactly two arguments, which the grouping algorithm will pass an arbitrary pair of glyphs from glyphs. If the two glyphs should be considered for grouping, the function should return True, else False.

In gamera.classify, there are two predefined grouping functions:

BoundingBoxGroupingFunction (threshold)
A callable class that returns True when the bounding boxes are at most threshold apart.
ShapedGroupingFunction (threshold)
A callable class that returns True when the closest distance between the black pixels is at most threshold.

When grouping_function is None, BoundingBoxGroupingFunction(4) is used.

evaluate_function

A function that evaluates a grouping of glyphs. This function must take exactly one argument which is a list of glyphs. The function should return a confidence value between 0 and 1 (1 being most confident) representing how confidently the grouping forms a valid character.

If no evaluate_function is provided, a default one will be used that returns the CONFIDENCE_DEFAULT of the knn classification.

max_parts_per_group
The maximum number of connected components that will be grouped together and tested as a group. For performance reasons, this number should be kept relatively small.
max_graph_size
Subgraphs (potentially connected areas of the image) larger than the given number of nodes will be ignored. This is a hack to prevent the runtime of the algorithm from exploding.
criterion
Determines how the grouping candidates ccs are evaluated against each other in the optimization step. Default = min choses the grouping with the highest minimum confidence, and avg that one with the highest average confidence.

The function returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying any glyphs as a split (See Initialization) or grouping. remove is a list of glyphs that are no longer valid due to reclassifying glyphs from a split to something else.

The list glyphs is never modified. Instead, detected parts of groups are classified as _group._part.*, where * stands for the class name of the grouped glyph. This means that after calling this function, you must remove the remove CCs and all CCs with a class name beginning with `_group._part from glyph, and you must add all glyphs from add to it. Or you can instead call group_and_update_list_automatic, which does this automatically for you.

group_and_update_list_automatic

group_and_update_list_automatic (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_parts_per_group = 5, int max_graph_size = 16, string criterion = 'min')

A convenience wrapper around group_list_automatic that returns a list of glyphs that is already updated for splitting and grouping.

Saving and loading

These functions deal with saving and loading the training data of the classifier to/from the Gamera XML format.

Note

UNCLASSIFIED glyphs in the training data are ignored (neither saved or loaded).

to_xml

to_xml (stream stream)

Saves the training data in XML format to the given stream (which could be any object supporting the file protocol, such as a file object or StringIO object).

to_xml_filename

to_xml_filename (FileSave filename)

Saves the training data in XML format to the given filename.

from_xml

from_xml (stream stream)

Loads the training data from the given stream (which could be any object supporting the file protocol, such as a file object or StringIO object.)

from_xml_filename

from_xml_filename (FileOpen filename)

Loads the training data from the given filename.

merge_from_xml

merge_from_xml (stream stream)

Loads the training data from the given stream (which could be a file handle or StringIO object) and adds it to the existing training data.

merge_from_xml_filename

merge_from_xml_filename (stream stream)

Loads the training data from the given filename and adds it to the existing training data.

Miscellaneous

is_interactive

Boolean is_interactive ()

Returns True if classifier is interactive, else False.

get_glyphs

ImageList get_glyphs ()

Returns a list of the glyphs in the classifier.

set_glyphs

set_glyphs (ImageList glyphs)

Sets the training data for the classifier to the given list of glyphs.

On some non-interactive classifiers, this operation can be quite expensive.

merge_glyphs

merge_glyphs (ImageList glyphs)

Adds the given glyphs to the current set of training data.

On some non-interactive classifiers, this operation can be quite expensive.

clear_glyphs

clear_glyphs ()

Removes all training data from the classifier.

Interactive classifiers

Classification

classify_glyph_manual

classify_glyph_manual (Image glyph, String id)

Sets the classification of the given glyph to the given id and then adds the glyph to the training data. Call this function when the end user definitively knows the identity of the glyph.

glyph
The glyph to classify.
id
The class name.

Note

Here id is a simple string, not of the id_name format, since the confidence of a manual classification is always 1.0.

classify_list_manual

classify_list_manual (ImageList glyphs, String id)

Sets the classification of the given glyphs to the given id and then adds the glyphs to the training data. Call this function when the end user definitively knows the identity of the glyphs.

If id begins with the special prefix _group, all of the glyphs in glyphs are combined and the result is added to the training data. This is useful for characters that always appear with multiple connnected components, such as the lower-case i.

glyphs
The glyphs to classify.
id
The class name.

Note

Here id is a simple string, not of the id_name format, since the confidence of a manual classification is always 1.0.

classify_and_update_list_manual

classify_and_update_list_manual (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_size = 5)

A convenience wrapper around group_list_automatic that returns a list of glyphs that is already updated for splitting and grouping.

add_to_database

add_to_database (ImageList glyphs)

Adds the given glyph (or list of glyphs) to the classifier training data. Will not add duplicates to the training data. Unlike classify_glyph_manual, no grouping support is performed.

remove_from_database

remove_from_database (ImageList glyphs)

Removes the given glyphs from the classifier training data. Ignores silently if a given glyph is not in the training data.

Display

display

display (ImageList current_database = [], Image context_image = None, List symbol_table = [])

Displays the interactive classifier window, which is where manual training usually takes place.

current_database
A list of glyphs yet to be trained.
context_image
An image of the page where the glyphs in current_database came from.
symbol_table
A set of id names to insert by default into the symbol table.

k Nearest Neighbor classifier

The k Nearest Neighbor classifier is a concrete example of the classifier API. It adds some methods of its own.

kNNNonInteractive has a number of advantages over kNNInteractive:

Note

It is good practice to retain the XML file, since it is portable across platforms and to future versions of Gamera. The binary format is not guaranteed to be portable.

Feature management

The classifier automatically manages the generation of feature vectors from glyphs. When a feature vector is needed because it is being automatically classified or added to the training set, it is automatically generated on-the-fly.

By default, the feature generation method in kNN is quite simple. The user of the classifier provides a list of feature function names (either in the constructor or through the change_feature_set method), and for each glyph, the results of each feature function in the set are appended together to produce a feature vector.

This basic feature generation method can be overridden and replaced with something more appropriate to other problem domains. See the overriding kNN's feature generation appendix for more information.

Methods on all kNN classes

kNN Initialization

__init__

kNNNonInteractive (ImageList database = [], features = 'all', bool perform_splits = True, int num_k = 1, bool normalize = False)

Creates a new kNN classifier instance.

database

Can be in one of two forms:

  • When a list (or Python iterable) each element is a glyph to use as training data for the classifier. (For non-interactive classifiers, this list must be non-empty).
  • For non-interactive classifiers, database may be a filename, in which case the classifier will be "unserialized" from the given file.

Any images in the list that were manually classified (have classification_state == MANUAL) will be used as training data for the classifier. Any UNCLASSIFIED or AUTOMATICALLY classified images will be ignored.

When initializing a noninteractive classifier, the database must be non-empty.

features
A list of feature function names to use for classification. These feature names correspond to the feature plugin methods. To use all available feature functions, pass in 'all'.
perform_splits

If perform_splits is True, glyphs trained with names beginning with _split. are run through a given splitting algorithm. For instance, glyphs that need to be broken into upper and lower halves for further classification of those parts would be trained as _split.splity. When the automatic classifier encounters glyphs that most closely match those trained as _split, it will perform the splitting algorithm and then continue to recursively classify its parts.

The splitting algorithms are documented in the plugin documentation.

New splitting algorithms can be created by writing plugin methods in the category Segmentation.
normalize
Normalize the feature vectors: x' = (x - mean_x)/stdev_x

Settings

Settings are various parameters that control the behavior of the classifier. While some are only accessible through the methods given below, the following settings are plain properties of all kNN classifier classes:

num_k
the number k of neighbors to be considered
distance_type
the distance measure for neighborhood. Can be one of CITY_BLOCK (default), EUCLIDEAN or FAST_EUCLIDEAN

change_feature_set

change_feature_set (features)

Changes the set of features used in the classifier to the given list of feature names.

features
These feature names correspond to the feature plugin methods. To use all available feature functions, pass in 'all'.

get_selections_by_features

get_selections_by_features ()

Get the selection vector elements.

This function returns a python dictionary: keys are the feature names, values are lists of zeros and ones, where ones correspond to the selected components.

get_selections_by_feature

get_selections_by_feature (String feature_name)

Convenience wrapper for get_selections_by_features function. This function returns only the selection values list for the given feature name.

get_weights_by_features

get_weights_by_features ()

Get the weighting vector elements.

This function returns a python dictionary: keys are the feature names, values are lists of real values in [0,1], which give the weight of the respective component.

get_weights_by_feature

get_weights_by_feature (String feature_name)

Convenience wrapper for get_weights_by_features function. This function returns only the weighting values list for the given feature name.

set_selections_by_features

set_selections_by_features (Dictionary values)

Set the selection vector elements by the corresponding feature name.

values
Python dictionary with feature names as keys and lists as values, as described in get_selections_by_features.

The dictionary must contain an entry for every feature of the currently active feature set, that has been set in the contructor of the classifier or by change_feature_set. Example:

classifier = knn.kNNNonInteractive("train.xml",
                                   ["aspect_ratio","moments"], 0)
classifier.set_selections_by_features({"aspect_ratio":[1],
                                       "moments":[0, 1, 1, 1, 1, 1, 1, 1, 0]})

set_selections_by_feature

set_selections_by_feature (String feature_name, List values)

Set the selection vector elements for one specific feature.

feature_name
The feature name as string.
values
Python list with the selection values for the given feature. Dimension of the list must match with the feature dimension.

set_weights_by_features

set_weights_by_features (Dictionary values)

Set the weighing vector elements by the corresponding feature name. The dictionary must contain an entry for every feature of the currently active feature set, that has been set in the contructor of the classifier or by change_feature_set. Example:

classifier = knn.kNNNonInteractive("train.xml",
                                   ["aspect_ratio","moments"], 0)
classifier.set_weights_by_features({"aspect_ratio":[0.6],
                                    "moments":[0.1, 1.0, 0.3, 0.5, 1.0, 0.0, 1.0, 0.9, 0.0]})

set_weights_by_feature

set_weights_by_feature (String feature_name, List values)

Set the weighting vector elements for one specific feature.

feature_name
The feature name as string.
values
Python list with the weighting values for the given feature. Dimension of the list must match with the feature dimension.

save_settings

save_settings (FileSave filename)

Save the kNN settings to the given filename. This settings file (which is XML) includes k, distance type, the current selection and weighting. This file is different from the one produced by serialize in that it contains only the settings and no data.

load_settings

load_settings (FileOpen filename)

Load the kNN settings from an XML file. See save_settings.

Serialization

serialize

serialize (FileSave filename)

Saves the classifier-specific settings and data in an optimized and classifer-specific format.

Note

It is good practice to retain the XML file, since it is portable across platforms and to future versions of Gamera. The binary format is not guaranteed to be portable.

unserialize

unserialize (FileOpen filename)

Opens the classifier-specific settings and data from an optimized and classifer-specific format.

Evaluation

evaluate

Float evaluate ()

Evaluate the performance of the kNN classifier using leave-one-out cross-validation. The return value is a floating-point number between 0.0 (0% correct) and 1.0 (100% correct).

knndistance_statistics

knndistance_statistics (Int k = 0)

Returns a list of average distances between each training sample and its k nearest neighbors. So, when you have n training samples, n average distance values are returned. This can be useful for distance rejection.

Each item in the returned list is a tuple (d, classname), where d is the average kNN distance and classname is the class name of the training sample. In most cases, the class name is of little interest, but it could be useful if you need class conditional distance statistics. Beware however, that the average distance is computed over neighbors belonging to any class, not just the same class. If you need the latter, you must create a new classifier from training samples belonging only to the specific class.

When k is zero, the property num_k of the knn classifier is used.

distance_from_images

distance_from_images (ImageList glyphs, Image glyph, Float max = None)

Compute a list of distances between a list of glyphs and a single glyph. Distances greater than max are not included in the output. The return value is a list of floating-point distances.

distance_between_images

distance_between_images (Image imagea, Image imageb)

Compute the distance between two images using the settings for the kNN object (distance_type, features, weights, etc). This can be used when more control over the distance computations are needed than with any of the other methods that work on multiple images at once.

distance_matrix

distance_matrix (ImageList images, Bool normalize = True)

Create a symmetric FloatImage containing all of the distances between the images in the list passed in. This is useful because it allows you to find the distance between any two pairs of images regardless of the order of the pairs.

normalize
When true, the features are normalized before performing the distance calculations.

unique_distances

unique_distances (ImageList images, Bool normalize = True)

Return a list of the unique pairs of images in the passed in list and the distances between them. The return list is a list of tuples of (distance, imagea, imageb) so that it easy to sort.

normalize
When true, the features are normalized before performing the distance calculations.

kNNInteractive

__init__

kNNInteractive (ImageList database = [], features = 'all', bool perform_splits = True, int num_k = 1)

Creates a new kNN interactive classifier instance.

database

Must be a list (or Python interable) containing glyphs to use as training data for the classifier.

Any images in the list that were manually classified (have classification_state == MANUAL) will be used as training data for the classifier. Any UNCLASSIFIED or AUTOMATICALLY classified images will be ignored.

When initializing a noninteractive classifier, the database must be non-empty.

features
A list of feature function names to use for classification. These feature names correspond to the feature plugin methods. To use all available feature functions, pass in 'all'.
perform_splits

If perform_splits is True, glyphs trained with names beginning with _split. are run through a given splitting algorithm. For instance, glyphs that need to be broken into upper and lower halves for further classification of those parts would be trained as _split.splity. When the automatic classifier encounters glyphs that most closely match those trained as _split, it will perform the splitting algorithm and then continue to recursively classify its parts.

The splitting algorithms are documented in the plugin documentation.

New splitting algorithms can be created by writing plugin methods in the category Segmentation.

noninteractive_copy

noninteractive_copy ()

Creates a non-interactive copy of the interactive classifier.

Improving kNN Classifiers using Editing

Gamera provides a way to improve kNN classifiers by modifying the underlying set of glyphs (training-set). This class of algorithms either removes bad or redundant glyphs or even creates new optimal glyphs from the training-set.

Besides the graphical user interface in the Classifier Display, it is also possible to invoke the algorithms from your script.

Each editing algorithm is a function, that takes at least one parameter, a kNNInteractive classifier - and returns a new edited kNNInteractive classifier. Any additional parameters depend on the effective algorithm, but are optional by convention.

Currently the following editing algorithms are included with Gamera:

edit_mnn

edit_mnn (kNNInteractive classifier, int k = 0, bool protectRare, int rareThreshold)

Wilson's Modified Nearest Neighbour (MNN, aka Leave-one-out-editing). The algorithm removes 'bad' glyphs from the classifier, i.e. glyphs that are outliers from their class in featurespace, usually because they have been manually misclassified or are not representative for their class

classifier
The classifier from which to create an edited copy
internalK
The k value used internally by the editing algorithm. If 0 is given for this parameter, the original classifier's k is used (recommended).
protect rare classes
The algorithm tends to completely delete the items of rare classes, removing this whole class from the classifier. If this is not desired these rare classes can be explicitly protected from deletion. Note that enabling this option causes additional computing effort
rare class threshold
In case protect rare classes is enabled, classes with less than this number of elements are considered to be rare

Reference: D. Wilson: 'Asymptotic Properties of NN Rules Using Edited Data'. IEEE Transactions on Systems, Man, and Cybernetics, 2(3):408-421, 1972

edit_cnn

edit_cnn (kNNInteractive classifier, int k = 0, bool randomize)

Hart's Condensed Nearest Neighbour (CNN) editing. This alorithm is specialized in removing superfluous glyphs - glyphs that do not influence the recognition rate - from the classifier to improve its classification speed. Typically glyphs far from the classifiers decision boundaries are removed.

classifier
The classifier from which to create an edited copy
internalK
The k value used internally by the editing algorithm. 0 means, use the same value as the given classifier (recommended)
randomize
Because the processing order of the glyphs in the classifier impacts the result of this algorithm, the order will be randomized. If reproducible results are required, turn this option off.

Reference: P.E. Hart: 'The Condensed Nearest Neighbor rule'. IEEE Transactions on Information Theory, 14(3):515-516, 1968

edit_mnn_cnn

edit_mnn_cnn (kNNInteractive classifier, int k = 0, bool protectRare, int rareThreshold, bool randomize)

Combined execution of Wilson's Modified Nearest Neighbour and Hart's Condensed Nearest Neighbour. Combining the algorithms in this order is recommended, because first bad samples are removed to improve the classifiers accuracy, and then the remaining samples are condensed to speed up the classifier

For documentation of the parameters see the independent algorithms

Usage Example

The algorithms are located in the gamera.knn_editing package. So a typical usage example employing Wilson's Modified Nearest Neighbour (edit_mnn) would look like this:

from gamera.core import init_gamera
init_gamera()

from gamera.knn import kNNInteractive
classifier = kNNInteractive()
classifier.from_xml_filename("training-set.xml")

from gamera.knn_editing import edit_mnn
editedClassifier = edit_mnn(classifier)

To display the glyphs removed by the editing algorithm, you can use the following code in the Gamera GUI:

set1 = classifier.get_glyphs()
set2 = editedClassifier.get_glyphs()
imagelist = list(set1.difference(set2))
display_multi(imagelist)

Integrating your own editing algorithm

If you have written your own editing algorithm and want to make it available in the Classifier GUI, refer to the documentation of the class gamera.knn_editing.AlgoRegistry.