Keras Data Augmentation for Scikit-learn

Here is a method to integrate a preprocessing utility from Keras with a model from Scikit-learn. The complete Jupyter notebook is in the reference section below.

We will be working with the MNIST data set:

  • 60,000 Examples for the training set.
  • 10,000 Examples for the test set.
  • 784 Features (Images of handwritten digits, 28 * 28 pixels).

We will use a KNeighborsClassifier, a lazy learner! It memories the data during training time and learns during runtime.

Here we will use the ImageDataGenerator class from Keras. First, we create an object:

This object will:

  • Shift the image by 2 pixels vertically and horizontally.
  • Rotates the image by 20 degrees.

Calling the fit method:

The fit method expects a rank 4 sample as an input (Examples, width, height, channels {1 for grayscale, 3 for RGB…}).

Then we call the flow method to generate the new data set:

We need to preserve the order of the data so that it matches the labels. The batch size should be 1 to get the whole data set.

The new data is an iterator, ideal for incremental learners, but our model expects an array-like list.

We have to materialize the iterator to using the comprehension list:

This step will turn the iterator into a list. The 4 is a multiplier for how many sets to generate (4 * 60k = 240k).

The last step is to reshape the data from rank 4 to its original shape of rank 2:

When we plot, we get this.

Now the data is ready for a second run!

Programming and Statistics for now!