Gamera classifier API

Last modified: September 16, 2022

Contents

Introduction
- Interactive vs. Noninteractive classifiers
- Types of classifiers
Image properties related to classification
The classifier interface
- Core
- Interactive classifiers
  - Classification
  - Display
    - display
k Nearest Neighbor classifier

Introduction

For manual training of a classifier, you will generally want to use the interactive classifier GUI. This document describes the programming API that is used by scripts that make use of a classifier.

At present, Gamera supports segmentation-based image classification. This means that the page image is first segmented into individual connected components (or glyphs). Each of these glyphs has a number of features generated from it. These features (collectively called a "feature vector") are then used inside a classifier which, using a database of training data, identifies the glyph.

Interactive vs. Noninteractive classifiers

All classifiers in Gamera support the same core Classifier API (interface), so they are interchangeable. There is an important distinction between two families of classifiers, however:

Interactive classifier: An interactive classifier can have training examples added to it in real time, and the results used immediately to classify glyphs. Interactive classifiers are useful during the training process since the classifier can be "boot strapped" with a few examples and refined interactively.
Noninteractive classifier: Noninteractive classifiers take a complete database of training data and then create an optimised data structure for classification. Because building that data structure can take a considerable amount of time, new training examples cannot be added on the fly. In addition, noninteractive classifiers are serializable to binary classifier-specific file formats, which save and load much faster than the Gamera XML format.

Types of classifiers

Within each of these families, different classifiers are available. These "concrete" classifiers have additional methods specific to the particular classifier type. The currently implemented classifiers are all k - nearest-neighbor, but we plan to add other classifiers as needed.

kNNInteractive

Interactive k nearest neighbor classifier.

kNNNonInteractive

Noninteractive k nearest neighbor classifier. The weights of the dimensions can be optimised using a genetic algorithm. To learn more about applying genetic algorithms for feature selection, see the Evolutionary Optimization Module documentation.

Image properties related to classification

Classification result are stored with the image. Some interface functions for storing and querying the classification result are described in the classification plugin documentation.

`id_name`

The class name of a glyph is stored in the member variable id_name. This variable is actually a list of possible classifications, so that a classifier can return a number of different possibilities with different confidences. Each classification entry is a tuple of the form (float confidence, string name), where the "confidence" measure is of the type CONFIDENCE_DEFAULT (see the confidence property below).

For example, if among the k nearest neighbor of the image there are training samples both from class 'lower.b' and 'lower.d', its id_name variable might be:

[(0.0879, 'lower.b'), (0.0012, 'lower.d')]

The first entry in id_name is always the class decision made by the classifier, so the other entries can (and should) generally be ignored. Due to the simplistic definition of the CONFIDENCE_DEFAULT measure, the decision made by the classifier must not necessarily have the highest "confidence", nor do the values of all "confidences" add to one.

`classification_state`

How a glyph was classified is managed by the classification_state member variable. It can be one of the following values:

Color Constant Description

(white) UNCLASSIFIED The connected component is completely unclassified.

(red) AUTOMATIC The connected component was classified by the automatic classifier using training data.

(yellow) HEURISTIC The connected component was classified by some heuristic (non-exemplar-based) process.

(green) MANUAL The connected component was classified by a human.

Color	Constant	Description
(white)	UNCLASSIFIED	The connected component is completely unclassified.
(red)	AUTOMATIC	The connected component was classified by the automatic classifier using training data.
(yellow)	HEURISTIC	The connected component was classified by some heuristic (non-exemplar-based) process.
(green)	MANUAL	The connected component was classified by a human.

`confidence`

Different confidence measures for the main class id are stored in the member variable confidence. This is a map ('dictionary' in python lingo) with a confidence type as key. The possible confidence type constants are defined in gamera.gameracore and have the following meaning:

CONFIDENCE_DEFAULT

(1-dist/max_dist)^10, where dist is the distance to the closest prototype of the main class, and max_dist is the distance to the farthest prototype of the entire training population. This "confidence" is always between 0.0 and 1.0, where 1.0 only occurs for feature values that exactly match a training sample.

CONFIDENCE_KNNFRACTION

Fraction of samples belonging to main class among the k nearest neighbors. Makes only sense for k > 1.

CONFIDENCE_INVERSEWEIGHT

Dudani's weighted average distance to the main class with each distance weighted inversely. Makes only sense for k > 1.

CONFIDENCE_LINEARWEIGHT

Dudani's weighted average distance to the main class with each distance weighted linearly between the closest neighbor (weight one) and the k-th nearest neighbor (weight zero). Makes only sense for k > 1.

CONFIDENCE_NUN

Dasarathy's "nearest unlike neighbor" measure. Let dnn denote the distance to the closest prototype of the main class (nearest neighbor) and dnun the distance to the closest prototype of the remaining classes (nearest unlike neighbor). Then the NUN confidence measure is (1-dnn/dnun).

CONFIDENCE_NNDISTANCE

Absolute distance to the nearest neighbor. This can take arbitrary values and is in itself meaningless, but can be useful for distance rejection.

CONFIDENCE_AVGDISTANCE

Average distance to the k nearest neighbors. This can take arbitrary values and is in itself meaningless, but can be useful for distance rejection. For k = 1, this is identical to CONFIDENCE_NNDISTANCE.

The confidence map is only filled for automatically classified glyphs. Which confidence types are actually written to the confidence map depends on the confidence_types property of the classifier. So when the list Classifier.confidence_types is empty, no confidence values will be written to Image.confidence.

Reference:

C. Dalitz: Reject Options and Confidence Measures for kNN Classifiers. In C. Dalitz (Ed.): "Document Image Analysis with the Gamera Framework." Schriftenreihe des Fachbereichs Elektrotechnik und Informatik, Hochschule Niederrhein, vol. 8, pp. 16-38, Shaker Verlag (2009)

Methods

Images have a number of methods for managing their classification state. Use of these methods is highly recommended over changing the id_name variable directly. These methods are documented in the classification section of the plugin documentation.

The classifier interface

This section describes each method of the classifier interface. The base class for all classifiers is Classifier, from which two classes NonInteractiveClassifier and InteractiveClassifier are derived. As noninteractive classifiers are more limited in their interface, this divides the methods into two categories:

Core: Methods available to all classifiers
Interactive: Methods available only to interactive classifiers

Core

The following methods and properties are available to all classifiers.

Properties

Each classifier has the following member variables:

_database

List of glyphs used as training data. This is a private property that can only be accessed through the methods get_glyphs and set_glyphs, or it is set in the constructor or with from_xml_filename. Note that the return value of get_glyphs must be converted to a list with list(classifier.get_glyphs()).

confidence_types

List of confidence types that are to be computed during classification. The confidence types must be from the predefined confidence constants.

Initialization

As the base class Classifier does not have an explicit constructor, the constructor of NonInteractiveClassifier is described here.

`init`

NonInteractiveClassifier (ImageList database = [], bool perform_splits = True)

Creates a new classifier instance.

database

Can come in two forms:

When a list (or Python iterable) each element is a glyph to use as training data for the classifier

For non-interactive classifiers only, when database is a filename, the classifier will be "unserialized" from the given file.

Any images in the list that were manually classified (have classification_state == MANUAL) will be used as training data for the classifier. Any UNCLASSIFIED or AUTOMATICALLY classified images will be ignored.

When initializing a noninteractive classifier, the database must be non-empty.

perform_splits

If perform_splits is True, glyphs trained with names beginning with _split. are run through a given splitting algorithm. For instance, glyphs that need to be broken into upper and lower halves for further classification of those parts would be trained as _split.splity. When the automatic classifier encounters glyphs that most closely match those trained as _split, it will perform the splitting algorithm and then continue to recursively classify its parts.

The splitting algorithms are documented in the plugin documentation.

New splitting algorithms can be created by writing plugin methods in the category Segmentation.

Classification

The following methods deal with classifying glyphs on an individual level.

`classify_glyph_automatic`

classify_glyph_automatic (Image glyph, int max_recursion = 10)

Classifies a glyph and sets its classification_state and id_name. (If you don't want the glyph changed, use guess_glyph_automatic.)

glyph: The glyph to classify.
max_recursion (optional): Limit the number of split recursions.

Returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying glyph as a split (See Initialization). remove is a list of glyphs that are no longer valid due to reclassifying glyph from a split to something else. Most often, both of these lists will be empty. You will normally want to use these lists to update the collection of glyphs on the current page.

`classify_list_automatic`

classify_list_automatic (ImageList glyphs, int max_recursion = 10)

Classifies a list of glyphs and sets the classification_state and id_name of each glyph. (If you don't want it set, use guess_glyph_automatic.)

glyphs: A list of glyphs to classify.
max_recursion: The maximum level of recursion to follow when splitting glyphs. Since some glyphs will split into parts that then classify as _split in turn, a maximum depth should be set to avoid infinite recursion. This number can normally be set quite low, depending on the application.

Return type: (add, remove)

The list glyphs is never modified by the function. Instead, it returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying glyphs as a split (See Initialization). remove is a list of glyphs that are no longer valid due to reclassifying glyphs from a split to something else. Most often, both of these lists will be empty. You will normally want to use these lists to update the collection of glyphs on the current page. If you just want a new list returned with these updates already made, use classify_and_update_list_automatic.

`classify_and_update_list_automatic`

classify_and_update_list_automatic (ImageList glyphs, Int max_recursion = 10)

A convenience wrapper around classify_list_automatic that returns a list of glyphs that is already updated based on splitting.

`guess_glyph_automatic`

(id_name, confidencemap) guess_glyph_automatic (Image glyph)

Classifies the given glyph without setting its classification. The return value is a tuple of the form (id_name,confidencemap), where idname is a list of the form idname, and confidencemap is a map of the form confidence listing the confidences of the main id.

`classify_with_images`

(id_name, confidencemap) classify_with_images (ImageList glyphs, Image glyph, bool cross_validation_mode=False, bool do_confidence=True )

Classifies an unknown image using the given list of images as training data. The glyph is classified without setting its classification. The return value is a tuple of the form (id_name,confidencemap), where idname is a list of the form idname, and confidencemap is a map of the form confidence listing the confidences of the main id.

Grouping

Often, characters do not cleanly correspond to connected components. For instance, broken or degraded printing may disconnect parts of a character, or characters, such as i may always be made up of two connected components. The grouping algorithm is designed to deal with those cases. It attempts to group connected components with others nearby in order to create groupings that are more like glyphs in the database. Needless to say, this approach is much slower than the "one-connected-component-at-a-time" approach, but can produce considerably better results on certain images.

To train for grouping, images corresponding to the entire character must exist in the database. For instance, in the Gamera GUI, one would select both the dot and stem of a lower-case i and train it as _group.lower.i. This will join the two connected components into a single image and then add it to the database.

The algorithm is described in more detail in our paper on correcting broken characters (PDF).

`group_list_automatic`

group_list_automatic (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_parts_per_group = 4, int max_graph_size = 16, criterion = 'min')

Classifies the given list of glyphs. Adjacent glyphs are joined together if doing so results in a higher global confidence. Each part of a joined glyph is classified as HEURISTIC with the prefix _group.

glyphs

The list of glyphs to group and classify.

grouping_function

A function that determines how glyphs are initially combined. This function must take exactly two arguments, which the grouping algorithm will pass an arbitrary pair of glyphs from glyphs. If the two glyphs should be considered for grouping, the function should return True, else False.

In gamera.classify, there are two predefined grouping functions:

BoundingBoxGroupingFunction (threshold): A callable class that returns True when the bounding boxes are at most threshold apart.
ShapedGroupingFunction (threshold): A callable class that returns True when the closest distance between the black pixels is at most threshold.

When grouping_function is None, BoundingBoxGroupingFunction(4) is used.

evaluate_function

A function that evaluates a grouping of glyphs. This function must take exactly one argument which is a list of glyphs. The function should return a confidence value between 0 and 1 (1 being most confident) representing how confidently the grouping forms a valid character.

If no evaluate_function is provided, a default one will be used that returns the CONFIDENCE_DEFAULT of the knn classification.

max_parts_per_group

The maximum number of connected components that will be grouped together and tested as a group. For performance reasons, this number should be kept relatively small.

max_graph_size

Subgraphs (potentially connected areas of the image) larger than the given number of nodes will be ignored. This is a hack to prevent the runtime of the algorithm from exploding.

criterion

Determines how the grouping candidates ccs are evaluated against each other in the optimization step. Default = min choses the grouping with the highest minimum confidence, and avg that one with the highest average confidence.

The function returns a 2-tuple (pair) of lists: (add, remove). add is a list of glyphs that were created by classifying any glyphs as a split (See Initialization) or grouping. remove is a list of glyphs that are no longer valid due to reclassifying glyphs from a split to something else.

The list glyphs is never modified. Instead, detected parts of groups are classified as _group._part.*, where * stands for the class name of the grouped glyph. This means that after calling this function, you must remove the remove CCs and all CCs with a class name beginning with `_group._part from glyph, and you must add all glyphs from add to it. Or you can instead call group_and_update_list_automatic, which does this automatically for you.

`group_and_update_list_automatic`

group_and_update_list_automatic (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_parts_per_group = 5, int max_graph_size = 16, string criterion = 'min')

A convenience wrapper around group_list_automatic that returns a list of glyphs that is already updated for splitting and grouping.

Saving and loading

These functions deal with saving and loading the training data of the classifier to/from the Gamera XML format.

Note

UNCLASSIFIED glyphs in the training data are ignored (neither saved or loaded).

`to_xml`

to_xml (stream stream)

Saves the training data in XML format to the given stream (which could be any object supporting the file protocol, such as a file object or StringIO object).

`to_xml_filename`

to_xml_filename (FileSave filename)

Saves the training data in XML format to the given filename.

`from_xml`

from_xml (stream stream)

Loads the training data from the given stream (which could be any object supporting the file protocol, such as a file object or StringIO object.)

`from_xml_filename`

from_xml_filename (FileOpen filename)

Loads the training data from the given filename.

`merge_from_xml`

merge_from_xml (stream stream)

Loads the training data from the given stream (which could be a file handle or StringIO object) and adds it to the existing training data.

`merge_from_xml_filename`

merge_from_xml_filename (stream stream)

Loads the training data from the given filename and adds it to the existing training data.

Miscellaneous

`is_interactive`

Boolean is_interactive ()

Returns True if classifier is interactive, else False.

`get_glyphs`

ImageList get_glyphs ()

Returns a list of the glyphs in the classifier.

`set_glyphs`

set_glyphs (ImageList glyphs)

Sets the training data for the classifier to the given list of glyphs.

On some non-interactive classifiers, this operation can be quite expensive.

`merge_glyphs`

merge_glyphs (ImageList glyphs)

Adds the given glyphs to the current set of training data.

On some non-interactive classifiers, this operation can be quite expensive.

`clear_glyphs`

clear_glyphs ()

Removes all training data from the classifier.

Interactive classifiers

Classification

`classify_glyph_manual`

classify_glyph_manual (Image glyph, String id)

Sets the classification of the given glyph to the given id and then adds the glyph to the training data. Call this function when the end user definitively knows the identity of the glyph.

glyph: The glyph to classify.
id: The class name.

Note

Here id is a simple string, not of the id_name format, since the confidence of a manual classification is always 1.0.

`classify_list_manual`

classify_list_manual (ImageList glyphs, String id)

Sets the classification of the given glyphs to the given id and then adds the glyphs to the training data. Call this function when the end user definitively knows the identity of the glyphs.

If id begins with the special prefix _group, all of the glyphs in glyphs are combined and the result is added to the training data. This is useful for characters that always appear with multiple connnected components, such as the lower-case i.

glyphs: The glyphs to classify.
id: The class name.

Note

Here id is a simple string, not of the id_name format, since the confidence of a manual classification is always 1.0.

`classify_and_update_list_manual`

classify_and_update_list_manual (ImageList glyphs, Function grouping_function = None, Function evaluate_function = None, int max_size = 5)

A convenience wrapper around group_list_automatic that returns a list of glyphs that is already updated for splitting and grouping.

`add_to_database`

add_to_database (ImageList glyphs)

Adds the given glyph (or list of glyphs) to the classifier training data. Will not add duplicates to the training data. Unlike classify_glyph_manual, no grouping support is performed.

`remove_from_database`

remove_from_database (ImageList glyphs)

Removes the given glyphs from the classifier training data. Ignores silently if a given glyph is not in the training data.

Display

`display`

display (ImageList current_database = [], Image context_image = None, List symbol_table = [])

Displays the interactive classifier window, which is where manual training usually takes place.

current_database: A list of glyphs yet to be trained.
context_image: An image of the page where the glyphs in current_database came from.
symbol_table: A set of id names to insert by default into the symbol table.

k Nearest Neighbor classifier

The k Nearest Neighbor classifier is a concrete example of the classifier API. It adds some methods of its own.

kNNNonInteractive has a number of advantages over kNNInteractive:

Each feature can optionally be normalized independently to zero mean and unit variance. This reduces the bias toward features that generate large values, such as area. This normalization may change the classifications that the classifier makes, however.
The selections and weights of the features can be optimized using a genetic algorithm (See here).
The training data can be serialized to a classifier-specific binary file format. This format saves and loads much faster than the Gamera XML file format.

Note

It is good practice to retain the XML file, since it is portable across platforms and to future versions of Gamera. The binary format is not guaranteed to be portable.

Feature management

The classifier automatically manages the generation of feature vectors from glyphs. When a feature vector is needed because it is being automatically classified or added to the training set, it is automatically generated on-the-fly.

By default, the feature generation method in kNN is quite simple. The user of the classifier provides a list of feature function names (either in the constructor or through the change_feature_set method), and for each glyph, the results of each feature function in the set are appended together to produce a feature vector.

This basic feature generation method can be overridden and replaced with something more appropriate to other problem domains. See the overriding kNN's feature generation appendix for more information.

Methods on all `kNN` classes

kNN Initialization

`init`

kNNNonInteractive (ImageList database = [], features = 'all', bool perform_splits = True, int num_k = 1, bool normalize = False)

Creates a new kNN classifier instance.

database

Can be in one of two forms:

When a list (or Python iterable) each element is a glyph to use as training data for the classifier. (For non-interactive classifiers, this list must be non-empty).

For non-interactive classifiers, database may be a filename, in which case the classifier will be "unserialized" from the given file.

When initializing a noninteractive classifier, the database must be non-empty.

features

A list of feature function names to use for classification. These feature names correspond to the feature plugin methods. To use all available feature functions, pass in 'all'.

perform_splits

The splitting algorithms are documented in the plugin documentation.

New splitting algorithms can be created by writing plugin methods in the category Segmentation.

normalize: Normalize the feature vectors: x' = (x - mean_x)/stdev_x

Settings

Settings are various parameters that control the behavior of the classifier. While some are only accessible through the methods given below, the following settings are plain properties of all kNN classifier classes:

num_k: the number k of neighbors to be considered
distance_type: the distance measure for neighborhood. Can be one of CITY_BLOCK (default), EUCLIDEAN or FAST_EUCLIDEAN

`change_feature_set`

change_feature_set (features)

Changes the set of features used in the classifier to the given list of feature names.

features: These feature names correspond to the feature plugin methods. To use all available feature functions, pass in 'all'.

`get_selections_by_features`

get_selections_by_features ()

Get the selection vector elements.

This function returns a python dictionary: keys are the feature names, values are lists of zeros and ones, where ones correspond to the selected components.

`get_selections_by_feature`

get_selections_by_feature (String feature_name)

Convenience wrapper for get_selections_by_features function. This function returns only the selection values list for the given feature name.

`get_weights_by_features`

get_weights_by_features ()

Get the weighting vector elements.

This function returns a python dictionary: keys are the feature names, values are lists of real values in [0,1], which give the weight of the respective component.

`get_weights_by_feature`

get_weights_by_feature (String feature_name)

Convenience wrapper for get_weights_by_features function. This function returns only the weighting values list for the given feature name.

`set_selections_by_features`

set_selections_by_features (Dictionary values)

Set the selection vector elements by the corresponding feature name.

values: Python dictionary with feature names as keys and lists as values, as described in get_selections_by_features.

The dictionary must contain an entry for every feature of the currently active feature set, that has been set in the constructor of the classifier or by change_feature_set. Example:

classifier = knn.kNNNonInteractive("train.xml",
                                   ["aspect_ratio","moments"], 0)
classifier.set_selections_by_features({"aspect_ratio":[1],
                                       "moments":[0, 1, 1, 1, 1, 1, 1, 1, 0]})

`set_selections_by_feature`

set_selections_by_feature (String feature_name, List values)

Set the selection vector elements for one specific feature.

feature_name: The feature name as string.
values: Python list with the selection values for the given feature. Dimension of the list must match with the feature dimension.

`set_weights_by_features`

set_weights_by_features (Dictionary values)

Set the weighing vector elements by the corresponding feature name. The dictionary must contain an entry for every feature of the currently active feature set, that has been set in the constructor of the classifier or by change_feature_set. Example:

classifier = knn.kNNNonInteractive("train.xml",
                                   ["aspect_ratio","moments"], 0)
classifier.set_weights_by_features({"aspect_ratio":[0.6],
                                    "moments":[0.1, 1.0, 0.3, 0.5, 1.0, 0.0, 1.0, 0.9, 0.0]})

`set_weights_by_feature`

set_weights_by_feature (String feature_name, List values)

Set the weighting vector elements for one specific feature.

feature_name: The feature name as string.
values: Python list with the weighting values for the given feature. Dimension of the list must match with the feature dimension.

`save_settings`

save_settings (FileSave filename)

Save the kNN settings to the given filename. This settings file (which is XML) includes k, distance type, the current selection and weighting. This file is different from the one produced by serialize in that it contains only the settings and no data.

`load_settings`

load_settings (FileOpen filename)

Load the kNN settings from an XML file. See save_settings.

Serialization

`serialize`

serialize (FileSave filename)

Saves the classifier-specific settings and data in an optimized and classifer-specific format.

Note

It is good practice to retain the XML file, since it is portable across platforms and to future versions of Gamera. The binary format is not guaranteed to be portable.

`unserialize`

unserialize (FileOpen filename)

Opens the classifier-specific settings and data from an optimized and classifer-specific format.

Evaluation

`evaluate`

Float evaluate ()

Evaluate the performance of the kNN classifier using leave-one-out cross-validation. The return value is a floating-point number between 0.0 (0% correct) and 1.0 (100% correct).

`knndistance_statistics`

knndistance_statistics (Int k = 0)

Returns a list of average distances between each training sample and its k nearest neighbors. So, when you have n training samples, n average distance values are returned. This can be useful for distance rejection.

Each item in the returned list is a tuple (d, classname), where d is the average kNN distance and classname is the class name of the training sample. In most cases, the class name is of little interest, but it could be useful if you need class conditional distance statistics. Beware however, that the average distance is computed over neighbors belonging to any class, not just the same class. If you need the latter, you must create a new classifier from training samples belonging only to the specific class.

When k is zero, the property num_k of the knn classifier is used.

`distance_from_images`

distance_from_images (ImageList glyphs, Image glyph, Float max = None)

Compute a list of distances between a list of glyphs and a single glyph. Distances greater than max are not included in the output. The return value is a list of floating-point distances.

`distance_between_images`

distance_between_images (Image imagea, Image imageb)

Compute the distance between two images using the settings for the kNN object (distance_type, features, weights, etc). This can be used when more control over the distance computations are needed than with any of the other methods that work on multiple images at once.

`distance_matrix`

distance_matrix (ImageList images, Bool normalize = True)

Create a symmetric FloatImage containing all of the distances between the images in the list passed in. This is useful because it allows you to find the distance between any two pairs of images regardless of the order of the pairs.

normalize: When true, the features are normalized before performing the distance calculations.

`unique_distances`

unique_distances (ImageList images, Bool normalize = True)

Return a list of the unique pairs of images in the passed in list and the distances between them. The return list is a list of tuples of (distance, imagea, imageb) so that it easy to sort.

normalize: When true, the features are normalized before performing the distance calculations.

`kNNInteractive`

`init`

kNNInteractive (ImageList database = [], features = 'all', bool perform_splits = True, int num_k = 1)

Creates a new kNN interactive classifier instance.

database

Must be a list (or Python interable) containing glyphs to use as training data for the classifier.

When initializing a noninteractive classifier, the database must be non-empty.

features

A list of feature function names to use for classification. These feature names correspond to the feature plugin methods. To use all available feature functions, pass in 'all'.

perform_splits

The splitting algorithms are documented in the plugin documentation.

New splitting algorithms can be created by writing plugin methods in the category Segmentation.

`noninteractive_copy`

noninteractive_copy ()

Creates a non-interactive copy of the interactive classifier.

Improving kNN Classifiers using Editing

Gamera provides a way to improve kNN classifiers by modifying the underlying set of glyphs (training-set). This class of algorithms either removes bad or redundant glyphs or even creates new optimal glyphs from the training-set.

Besides the graphical user interface in the Classifier Display, it is also possible to invoke the algorithms from your script.

Each editing algorithm is a function, that takes at least one parameter, a kNNInteractive classifier - and returns a new edited kNNInteractive classifier. Any additional parameters depend on the effective algorithm, but are optional by convention.

Currently the following editing algorithms are included with Gamera:

`edit_mnn`

edit_mnn (kNNInteractive classifier, int k = 0, bool protectRare, int rareThreshold)

Wilson's Modified Nearest Neighbour (MNN, aka Leave-one-out-editing). The algorithm removes 'bad' glyphs from the classifier, i.e. glyphs that are outliers from their class in featurespace, usually because they have been manually misclassified or are not representative for their class

classifier

The classifier from which to create an edited copy

internalK

The k value used internally by the editing algorithm. If 0 is given for this parameter, the original classifier's k is used (recommended).

protect rare classes

The algorithm tends to completely delete the items of rare classes, removing this whole class from the classifier. If this is not desired these rare classes can be explicitly protected from deletion. Note that enabling this option causes additional computing effort

rare class threshold

In case protect rare classes is enabled, classes with less than this number of elements are considered to be rare

Reference: D. Wilson: 'Asymptotic Properties of NN Rules Using Edited Data'. IEEE Transactions on Systems, Man, and Cybernetics, 2(3):408-421, 1972

`edit_cnn`

edit_cnn (kNNInteractive classifier, int k = 0, bool randomize)

Hart's Condensed Nearest Neighbour (CNN) editing. This alorithm is specialized in removing superfluous glyphs - glyphs that do not influence the recognition rate - from the classifier to improve its classification speed. Typically glyphs far from the classifiers decision boundaries are removed.

classifier

The classifier from which to create an edited copy

internalK

The k value used internally by the editing algorithm. 0 means, use the same value as the given classifier (recommended)

randomize

Because the processing order of the glyphs in the classifier impacts the result of this algorithm, the order will be randomized. If reproducible results are required, turn this option off.

Reference: P.E. Hart: 'The Condensed Nearest Neighbor rule'. IEEE Transactions on Information Theory, 14(3):515-516, 1968

`edit_mnn_cnn`

edit_mnn_cnn (kNNInteractive classifier, int k = 0, bool protectRare, int rareThreshold, bool randomize)

Combined execution of Wilson's Modified Nearest Neighbour and Hart's Condensed Nearest Neighbour. Combining the algorithms in this order is recommended, because first bad samples are removed to improve the classifiers accuracy, and then the remaining samples are condensed to speed up the classifier

For documentation of the parameters see the independent algorithms

Usage Example

The algorithms are located in the gamera.knn_editing package. So a typical usage example employing Wilson's Modified Nearest Neighbour (edit_mnn) would look like this:

from gamera.core import init_gamera
init_gamera()

from gamera.knn import kNNInteractive
classifier = kNNInteractive()
classifier.from_xml_filename("training-set.xml")

from gamera.knn_editing import edit_mnn
editedClassifier = edit_mnn(classifier)

To display the glyphs removed by the editing algorithm, you can use the following code in the Gamera GUI:

set1 = classifier.get_glyphs()
set2 = editedClassifier.get_glyphs()
imagelist = list(set1.difference(set2))
display_multi(imagelist)

Integrating your own editing algorithm

If you have written your own editing algorithm and want to make it available in the Classifier GUI, refer to the documentation of the class gamera.knn_editing.AlgoRegistry.