Overriding kNN's feature generation

Last modified: September 16, 2022

Contents

Introduction

The classifier automatically manages the generation of feature vectors from glyphs. When a feature vector is needed because it is being classified or added to the training set, it is automatically generated on-the-fly.

By default, the feature generation method in kNN is quite simple. The user of the classifier provides a list of feature function names (either in the constructor or through the change_feature_set method), and for each glyph, the results of each feature function in the set are appended together to produce a feature vector.

This simple feature generation method can be overridden and replaced with something more appropriate to specific problem domains. The only restriction (which is an inherent limitation of the k-nearest neighbor algorithm) is that the length of the feature vectors must be the same for all glyphs within a single classifier. If your problem requires variable-length feature vectors, you will need to use another kind of classifier or use multiple kNN classifiers.

Creating a subclass of kNN

Changing how feature generation works is as simple as creating a new class that inherits from either kNNInteractive or kNNNonInteractive and overriding the generate_features method. __init__ may also be overridden in order to accept parameters specific to the class.

from gamera import knn, classify
import array

class MykNN(knn.kNNInteractive):
   def __init__(self, database, num_features, perform_splits=True):
      ... described below ...

   def generate_features(self, glyph):
      ... described below ...

__init__

The __init__ function must do two things:

  • initialize the low-level kNN classifier and set the length of the feature vectors.
  • initialize the high-level classifier interface with the initial database of glyphs (or filename).

In the example below, the length of the feature vector is set using an argument to the constructor. Note that this is in contrast to the default behavior, where a list of feature functions is passed in and the length of the feature vectors is calculated from that.

def __init__(self, database, num_features, perform_splits=True):
   knn._kNNBase.__init__(self, num_features)
   classify.InteractiveClassifier.__init__(self, database, perform_splits=perform_splits)

generate_features

The generate_features function is where the feature generation actually happens and is called every time the classifier needs a feature vector for a particular glyph.

The feature vector itself must be stored in the member features of the glyph itself. This must be a Python array of double, and its length must be equal to the feature vector length for the whole classifier. For efficiency reasons, the type-checking within kNN is very weak, so this point is very important.

Of course, the actual content of the feature vector will be computed through some process (which is the whole point of this document).

As a trivial example, the following simply generates a feature vector full of zeros:

def generate_features(self, glyph):
  glyph.features = array.array('d', [0] * self.num_features)

Pitfalls and performance issues

When you start to put this into a real system, things get more complex. generate_features may be called multiple times on the same glyph. For instance, this can happen when the same glyph is classified more than once. More dangerously, it can happen when glyphs are moved around between different classifiers. If those classifiers have different length feature vectors (or even different sets and ordering of features) things won't work as expected, because the classifiers will be sharing incompatible feature vectors. It is therefore a good idea to always either keep the sets of glyphs orthogonal between all active classifiers, or copy the glyph instances when moving from one classifier to another.

Calling generate_features multiple times on the same glyph may also have performance implications. In the default implementation, when a feature vector is generated, the feature functions that were used to generate it are also stored in a member of the glyph. The next time generate_features is called for the glyph, the feature vector is re-generated only if the set of feature functions being used is different from the last time they were generated. This saves a redundant time-consuming feature generation operation.

It is impossible for the classifier system to know what external forces will require feature regeneration, so it is the resposibility of the custom generate_features method to determine the need to regenerate.

In the following example, the feature vector is regenerated if the feature vector length has changed since the last time it was generated.

def generate_features(self, glyph):
  if len(glyph.features) != self.num_features:
      glyph.features = array.array('d', [0] * self.num_features)