Build Your Own Scikit-learn Transformer

Photo by Victor on Unsplash

Using a pipeline to transform data will allow you to use the preprocessing step as a tunable hyperparameter during grid search. We do this by creating a class.
The custom transformer class should inherit the BaseEstimator and TransformerMixin class.
In this example, we will use the titanic data set. We aim to extract a new feature by combining two columns. We will combine the siblings/spouses and the parents/children columns, add both to extract a new feature family size.

Let’s look at the following class:

class CustomTransformer(TransformerMixin, BaseEstimator):
def __init__(self, combine_sibsp_parch=True):
self.combine_sibsp_parch = combine_sibsp_parch
def fit(self, X, y=None):
return self
def transform(self, X):
if self.combine_sibsp_parch:
return np.c_[X, X.sibsp + X.parch]
else:
return X
  • def __init__()
    The initialization method has the parameter combine_sibsp_parch.
  • def fit()
    The fit method returns the instance.
  • def transform()
    The transform method is where the transformation happens. In this case, we add both columns, add the result to the rest of the columns. However, if combine_sibsp_parch = False, we return the data unaltered.
np.c_[X, X.sibsp + X.parch]

Now, we create a pipeline and call the transform method on X_train. We can see that the number of the columns went from 13 to 14.

pipe = make_pipeline(CustomTransformer(combine_sibsp_parch=True))titanic_tr = pipe.transform(X_train)X_train.shape       # => (1047, 13)titanic_tr.shape    # => (1047, 14)

Finally, assert our result.

check_res = titanic_tr[:, 4] + titanic_tr[:, 5] == titanic_tr[:, -1]

The length of check_res is 1047. Using alien technology, we can reduce the vector to a single representative value.

reduce(lambda a, b: a == b, check_res)    # => True

You can find the complete code here:

--

--

--

https://github.com/booletic

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Can the Open Source Community Slay the Patent and Copyright Trolls?

Glitch Worm Dev Journal — Week 6

Configuring VPC Components in AWS

Entity Framework with Google Cloud SQL

CS373 Spring 2022: Swapnil Shaurya

Weekly Centina Prompt: The “Tube”

SKB — Scala Map for List

Segment Trees

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mansoor Aldosari

Mansoor Aldosari

https://github.com/booletic

More from Medium

How to measure model training speed

Training time

Introduction to Machine Learning

Feature Engineering using Keras Lambda Layers for complete training pipeline.

K-Means Clustering — An Unsupervised Machine Learning Algorithm