Last modified: September 16, 2022
Contents
For manual training of a classifier, you will generally want to use the interactive classifier GUI. This document describes the programming API that is used by scripts that make use of a classifier.
At present, Gamera supports segmentation-based image classification. This means that the page image is first segmented into individual connected components (or glyphs). Each of these glyphs has a number of features generated from it. These features (collectively called a "feature vector") are then used inside a classifier which, using a database of training data, identifies the glyph.
All classifiers in Gamera support the same core Classifier API (interface), so they are interchangeable. There is an important distinction between two families of classifiers, however:
Within each of these families, different classifiers are available. These "concrete" classifiers have additional methods specific to the particular classifier type. The currently implemented classifiers are all k - nearest-neighbor, but we plan to add other classifiers as needed.
- kNNInteractive
- Interactive k nearest neighbor classifier.
- kNNNonInteractive
- Noninteractive k nearest neighbor classifier. The weights of the dimensions can be optimised using a genetic algorithm. To learn more about applying genetic algorithms for feature selection, see the Evolutionary Optimization Module documentation.
This section describes each method of the classifier interface. The base class for all classifiers is Classifier, from which two classes NonInteractiveClassifier and InteractiveClassifier are derived. As noninteractive classifiers are more limited in their interface, this divides the methods into two categories:
The following methods and properties are available to all classifiers.
Each classifier has the following member variables:
- _database
- List of glyphs used as training data. This is a private property that can only be accessed through the methods get_glyphs and set_glyphs, or it is set in the constructor or with from_xml_filename. Note that the return value of get_glyphs must be converted to a list with list(classifier.get_glyphs()).
- confidence_types
- List of confidence types that are to be computed during classification. The confidence types must be from the predefined confidence constants.
As the base class Classifier does not have an explicit constructor, the constructor of NonInteractiveClassifier is described here.
NonInteractiveClassifier (ImageList database = [], bool perform_splits = True)
Creates a new classifier instance.
Can come in two forms:
- When a list (or Python iterable) each element is a glyph to use as training data for the classifier
- For non-interactive classifiers only, when database is a filename, the classifier will be "unserialized" from the given file.
Any images in the list that were manually classified (have classification_state == MANUAL) will be used as training data for the classifier. Any UNCLASSIFIED or AUTOMATICALLY classified images will be ignored.
When initializing a noninteractive classifier, the database must be non-empty.
If perform_splits is True, glyphs trained with names beginning with _split. are run through a given splitting algorithm. For instance, glyphs that need to be broken into upper and lower halves for further classification of those parts would be trained as _split.splity. When the automatic classifier encounters glyphs that most closely match those trained as _split, it will perform the splitting algorithm and then continue to recursively classify its parts.
The splitting algorithms are documented in the plugin documentation.
New splitting algorithms can be created by writing plugin methods in the category Segmentation.
The following methods deal with classifying glyphs on an individual level.
classify_glyph_automatic (Image glyph, int max_recursion = 10)
Classifies a glyph and sets its classification_state and id_name. (If you don't want the glyph changed, use guess_glyph_automatic.)
Returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying glyph as a split (See Initialization). remove is a list of glyphs that are no longer valid due to reclassifying glyph from a split to something else. Most often, both of these lists will be empty. You will normally want to use these lists to update the collection of glyphs on the current page.
classify_list_automatic (ImageList glyphs, int max_recursion = 10)
Classifies a list of glyphs and sets the classification_state and id_name of each glyph. (If you don't want it set, use guess_glyph_automatic.)
Return type: (add, remove)
The list glyphs is never modified by the function. Instead, it returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying glyphs as a split (See Initialization). remove is a list of glyphs that are no longer valid due to reclassifying glyphs from a split to something else. Most often, both of these lists will be empty. You will normally want to use these lists to update the collection of glyphs on the current page. If you just want a new list returned with these updates already made, use classify_and_update_list_automatic.
classify_and_update_list_automatic (ImageList glyphs, Int max_recursion = 10)
A convenience wrapper around classify_list_automatic that returns a list of glyphs that is already updated based on splitting.
(id_name, confidencemap) guess_glyph_automatic (Image glyph)
Classifies the given glyph without setting its classification. The return value is a tuple of the form (id_name,confidencemap), where idname is a list of the form idname, and confidencemap is a map of the form confidence listing the confidences of the main id.
(id_name, confidencemap) classify_with_images (ImageList glyphs, Image glyph, bool cross_validation_mode=False, bool do_confidence=True )
Classifies an unknown image using the given list of images as training data. The glyph is classified without setting its classification. The return value is a tuple of the form (id_name,confidencemap), where idname is a list of the form idname, and confidencemap is a map of the form confidence listing the confidences of the main id.
Often, characters do not cleanly correspond to connected components. For instance, broken or degraded printing may disconnect parts of a character, or characters, such as i may always be made up of two connected components. The grouping algorithm is designed to deal with those cases. It attempts to group connected components with others nearby in order to create groupings that are more like glyphs in the database. Needless to say, this approach is much slower than the "one-connected-component-at-a-time" approach, but can produce considerably better results on certain images.
To train for grouping, images corresponding to the entire character must exist in the database. For instance, in the Gamera GUI, one would select both the dot and stem of a lower-case i and train it as _group.lower.i. This will join the two connected components into a single image and then add it to the database.
The algorithm is described in more detail in our paper on correcting broken characters (PDF).
group_list_automatic (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_parts_per_group = 4, int max_graph_size = 16, criterion = 'min')
Classifies the given list of glyphs. Adjacent glyphs are joined together if doing so results in a higher global confidence. Each part of a joined glyph is classified as HEURISTIC with the prefix _group.
A function that determines how glyphs are initially combined. This function must take exactly two arguments, which the grouping algorithm will pass an arbitrary pair of glyphs from glyphs. If the two glyphs should be considered for grouping, the function should return True, else False.
In gamera.classify, there are two predefined grouping functions:
When grouping_function is None, BoundingBoxGroupingFunction(4) is used.
A function that evaluates a grouping of glyphs. This function must take exactly one argument which is a list of glyphs. The function should return a confidence value between 0 and 1 (1 being most confident) representing how confidently the grouping forms a valid character.
If no evaluate_function is provided, a default one will be used that returns the CONFIDENCE_DEFAULT of the knn classification.
The function returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying any glyphs as a split (See Initialization) or grouping. remove is a list of glyphs that are no longer valid due to reclassifying glyphs from a split to something else.
The list glyphs is never modified. Instead, detected parts of groups are classified as _group._part.*, where * stands for the class name of the grouped glyph. This means that after calling this function, you must remove the remove CCs and all CCs with a class name beginning with `_group._part from glyph, and you must add all glyphs from add to it. Or you can instead call group_and_update_list_automatic, which does this automatically for you.
group_and_update_list_automatic (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_parts_per_group = 5, int max_graph_size = 16, string criterion = 'min')
A convenience wrapper around group_list_automatic that returns a list of glyphs that is already updated for splitting and grouping.
These functions deal with saving and loading the training data of the classifier to/from the Gamera XML format.
Note
UNCLASSIFIED glyphs in the training data are ignored (neither saved or loaded).
to_xml (stream stream)
Saves the training data in XML format to the given stream (which could be any object supporting the file protocol, such as a file object or StringIO object).
to_xml_filename (FileSave filename)
Saves the training data in XML format to the given filename.
from_xml (stream stream)
Loads the training data from the given stream (which could be any object supporting the file protocol, such as a file object or StringIO object.)
from_xml_filename (FileOpen filename)
Loads the training data from the given filename.
merge_from_xml (stream stream)
Loads the training data from the given stream (which could be a file handle or StringIO object) and adds it to the existing training data.
merge_from_xml_filename (stream stream)
Loads the training data from the given filename and adds it to the existing training data.
set_glyphs (ImageList glyphs)
Sets the training data for the classifier to the given list of glyphs.
On some non-interactive classifiers, this operation can be quite expensive.
merge_glyphs (ImageList glyphs)
Adds the given glyphs to the current set of training data.
On some non-interactive classifiers, this operation can be quite expensive.
classify_glyph_manual (Image glyph, String id)
Sets the classification of the given glyph to the given id and then adds the glyph to the training data. Call this function when the end user definitively knows the identity of the glyph.
Note
Here id is a simple string, not of the id_name format, since the confidence of a manual classification is always 1.0.
classify_list_manual (ImageList glyphs, String id)
Sets the classification of the given glyphs to the given id and then adds the glyphs to the training data. Call this function when the end user definitively knows the identity of the glyphs.
If id begins with the special prefix _group, all of the glyphs in glyphs are combined and the result is added to the training data. This is useful for characters that always appear with multiple connnected components, such as the lower-case i.
Note
Here id is a simple string, not of the id_name format, since the confidence of a manual classification is always 1.0.
classify_and_update_list_manual (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_size = 5)
A convenience wrapper around group_list_automatic that returns a list of glyphs that is already updated for splitting and grouping.
add_to_database (ImageList glyphs)
Adds the given glyph (or list of glyphs) to the classifier training data. Will not add duplicates to the training data. Unlike classify_glyph_manual, no grouping support is performed.
remove_from_database (ImageList glyphs)
Removes the given glyphs from the classifier training data. Ignores silently if a given glyph is not in the training data.
display (ImageList current_database = [], Image context_image = None, List symbol_table = [])
Displays the interactive classifier window, which is where manual training usually takes place.
The k Nearest Neighbor classifier is a concrete example of the classifier API. It adds some methods of its own.
kNNNonInteractive has a number of advantages over kNNInteractive:
Note
It is good practice to retain the XML file, since it is portable across platforms and to future versions of Gamera. The binary format is not guaranteed to be portable.
The classifier automatically manages the generation of feature vectors from glyphs. When a feature vector is needed because it is being automatically classified or added to the training set, it is automatically generated on-the-fly.
By default, the feature generation method in kNN is quite simple. The user of the classifier provides a list of feature function names (either in the constructor or through the change_feature_set method), and for each glyph, the results of each feature function in the set are appended together to produce a feature vector.
This basic feature generation method can be overridden and replaced with something more appropriate to other problem domains. See the overriding kNN's feature generation appendix for more information.
kNNNonInteractive (ImageList database = [], features = 'all', bool perform_splits = True, int num_k = 1, bool normalize = False)
Creates a new kNN classifier instance.
Can be in one of two forms:
- When a list (or Python iterable) each element is a glyph to use as training data for the classifier. (For non-interactive classifiers, this list must be non-empty).
- For non-interactive classifiers, database may be a filename, in which case the classifier will be "unserialized" from the given file.
Any images in the list that were manually classified (have classification_state == MANUAL) will be used as training data for the classifier. Any UNCLASSIFIED or AUTOMATICALLY classified images will be ignored.
When initializing a noninteractive classifier, the database must be non-empty.
If perform_splits is True, glyphs trained with names beginning with _split. are run through a given splitting algorithm. For instance, glyphs that need to be broken into upper and lower halves for further classification of those parts would be trained as _split.splity. When the automatic classifier encounters glyphs that most closely match those trained as _split, it will perform the splitting algorithm and then continue to recursively classify its parts.
The splitting algorithms are documented in the plugin documentation.
New splitting algorithms can be created by writing plugin methods in the category Segmentation.
Settings are various parameters that control the behavior of the classifier. While some are only accessible through the methods given below, the following settings are plain properties of all kNN classifier classes:
change_feature_set (features)
Changes the set of features used in the classifier to the given list of feature names.
get_selections_by_features ()
Get the selection vector elements.
This function returns a python dictionary: keys are the feature names, values are lists of zeros and ones, where ones correspond to the selected components.
get_selections_by_feature (String feature_name)
Convenience wrapper for get_selections_by_features function. This function returns only the selection values list for the given feature name.
get_weights_by_features ()
Get the weighting vector elements.
This function returns a python dictionary: keys are the feature names, values are lists of real values in [0,1], which give the weight of the respective component.
get_weights_by_feature (String feature_name)
Convenience wrapper for get_weights_by_features function. This function returns only the weighting values list for the given feature name.
set_selections_by_features (Dictionary values)
Set the selection vector elements by the corresponding feature name.
The dictionary must contain an entry for every feature of the currently active feature set, that has been set in the constructor of the classifier or by change_feature_set. Example:
classifier = knn.kNNNonInteractive("train.xml",
["aspect_ratio","moments"], 0)
classifier.set_selections_by_features({"aspect_ratio":[1],
"moments":[0, 1, 1, 1, 1, 1, 1, 1, 0]})
set_selections_by_feature (String feature_name, List values)
Set the selection vector elements for one specific feature.
set_weights_by_features (Dictionary values)
Set the weighing vector elements by the corresponding feature name. The dictionary must contain an entry for every feature of the currently active feature set, that has been set in the constructor of the classifier or by change_feature_set. Example:
classifier = knn.kNNNonInteractive("train.xml",
["aspect_ratio","moments"], 0)
classifier.set_weights_by_features({"aspect_ratio":[0.6],
"moments":[0.1, 1.0, 0.3, 0.5, 1.0, 0.0, 1.0, 0.9, 0.0]})
set_weights_by_feature (String feature_name, List values)
Set the weighting vector elements for one specific feature.
save_settings (FileSave filename)
Save the kNN settings to the given filename. This settings file (which is XML) includes k, distance type, the current selection and weighting. This file is different from the one produced by serialize in that it contains only the settings and no data.
load_settings (FileOpen filename)
Load the kNN settings from an XML file. See save_settings.
serialize (FileSave filename)
Saves the classifier-specific settings and data in an optimized and classifer-specific format.
Note
It is good practice to retain the XML file, since it is portable across platforms and to future versions of Gamera. The binary format is not guaranteed to be portable.
unserialize (FileOpen filename)
Opens the classifier-specific settings and data from an optimized and classifer-specific format.
Float evaluate ()
Evaluate the performance of the kNN classifier using leave-one-out cross-validation. The return value is a floating-point number between 0.0 (0% correct) and 1.0 (100% correct).
knndistance_statistics (Int k = 0)
Returns a list of average distances between each training sample and its k nearest neighbors. So, when you have n training samples, n average distance values are returned. This can be useful for distance rejection.
Each item in the returned list is a tuple (d, classname), where d is the average kNN distance and classname is the class name of the training sample. In most cases, the class name is of little interest, but it could be useful if you need class conditional distance statistics. Beware however, that the average distance is computed over neighbors belonging to any class, not just the same class. If you need the latter, you must create a new classifier from training samples belonging only to the specific class.
When k is zero, the property num_k of the knn classifier is used.
distance_from_images (ImageList glyphs, Image glyph, Float max = None)
Compute a list of distances between a list of glyphs and a single glyph. Distances greater than max are not included in the output. The return value is a list of floating-point distances.
distance_between_images (Image imagea, Image imageb)
Compute the distance between two images using the settings for the kNN object (distance_type, features, weights, etc). This can be used when more control over the distance computations are needed than with any of the other methods that work on multiple images at once.
distance_matrix (ImageList images, Bool normalize = True)
Create a symmetric FloatImage containing all of the distances between the images in the list passed in. This is useful because it allows you to find the distance between any two pairs of images regardless of the order of the pairs.
unique_distances (ImageList images, Bool normalize = True)
Return a list of the unique pairs of images in the passed in list and the distances between them. The return list is a list of tuples of (distance, imagea, imageb) so that it easy to sort.
kNNInteractive (ImageList database = [], features = 'all', bool perform_splits = True, int num_k = 1)
Creates a new kNN interactive classifier instance.
Must be a list (or Python interable) containing glyphs to use as training data for the classifier.
Any images in the list that were manually classified (have classification_state == MANUAL) will be used as training data for the classifier. Any UNCLASSIFIED or AUTOMATICALLY classified images will be ignored.
When initializing a noninteractive classifier, the database must be non-empty.
If perform_splits is True, glyphs trained with names beginning with _split. are run through a given splitting algorithm. For instance, glyphs that need to be broken into upper and lower halves for further classification of those parts would be trained as _split.splity. When the automatic classifier encounters glyphs that most closely match those trained as _split, it will perform the splitting algorithm and then continue to recursively classify its parts.
The splitting algorithms are documented in the plugin documentation.
New splitting algorithms can be created by writing plugin methods in the category Segmentation.
noninteractive_copy ()
Creates a non-interactive copy of the interactive classifier.
Gamera provides a way to improve kNN classifiers by modifying the underlying set of glyphs (training-set). This class of algorithms either removes bad or redundant glyphs or even creates new optimal glyphs from the training-set.
Besides the graphical user interface in the Classifier Display, it is also possible to invoke the algorithms from your script.
Each editing algorithm is a function, that takes at least one parameter, a kNNInteractive classifier - and returns a new edited kNNInteractive classifier. Any additional parameters depend on the effective algorithm, but are optional by convention.
Currently the following editing algorithms are included with Gamera:
edit_mnn (kNNInteractive classifier, int k = 0, bool protectRare, int rareThreshold)
Wilson's Modified Nearest Neighbour (MNN, aka Leave-one-out-editing). The algorithm removes 'bad' glyphs from the classifier, i.e. glyphs that are outliers from their class in featurespace, usually because they have been manually misclassified or are not representative for their class
- classifier
- The classifier from which to create an edited copy
- internalK
- The k value used internally by the editing algorithm. If 0 is given for this parameter, the original classifier's k is used (recommended).
- protect rare classes
- The algorithm tends to completely delete the items of rare classes, removing this whole class from the classifier. If this is not desired these rare classes can be explicitly protected from deletion. Note that enabling this option causes additional computing effort
- rare class threshold
- In case protect rare classes is enabled, classes with less than this number of elements are considered to be rare
Reference: D. Wilson: 'Asymptotic Properties of NN Rules Using Edited Data'. IEEE Transactions on Systems, Man, and Cybernetics, 2(3):408-421, 1972
edit_cnn (kNNInteractive classifier, int k = 0, bool randomize)
Hart's Condensed Nearest Neighbour (CNN) editing. This alorithm is specialized in removing superfluous glyphs - glyphs that do not influence the recognition rate - from the classifier to improve its classification speed. Typically glyphs far from the classifiers decision boundaries are removed.
- classifier
- The classifier from which to create an edited copy
- internalK
- The k value used internally by the editing algorithm. 0 means, use the same value as the given classifier (recommended)
- randomize
- Because the processing order of the glyphs in the classifier impacts the result of this algorithm, the order will be randomized. If reproducible results are required, turn this option off.
Reference: P.E. Hart: 'The Condensed Nearest Neighbor rule'. IEEE Transactions on Information Theory, 14(3):515-516, 1968
edit_mnn_cnn (kNNInteractive classifier, int k = 0, bool protectRare, int rareThreshold, bool randomize)
Combined execution of Wilson's Modified Nearest Neighbour and Hart's Condensed Nearest Neighbour. Combining the algorithms in this order is recommended, because first bad samples are removed to improve the classifiers accuracy, and then the remaining samples are condensed to speed up the classifier
For documentation of the parameters see the independent algorithms
The algorithms are located in the gamera.knn_editing package. So a typical usage example employing Wilson's Modified Nearest Neighbour (edit_mnn) would look like this:
from gamera.core import init_gamera
init_gamera()
from gamera.knn import kNNInteractive
classifier = kNNInteractive()
classifier.from_xml_filename("training-set.xml")
from gamera.knn_editing import edit_mnn
editedClassifier = edit_mnn(classifier)
To display the glyphs removed by the editing algorithm, you can use the following code in the Gamera GUI:
set1 = classifier.get_glyphs()
set2 = editedClassifier.get_glyphs()
imagelist = list(set1.difference(set2))
display_multi(imagelist)
If you have written your own editing algorithm and want to make it available in the Classifier GUI, refer to the documentation of the class gamera.knn_editing.AlgoRegistry.