Build Your Own Scikit-learn Transformer

Mansoor Aldosari
2 min readAug 15, 2021
Photo by Victor on Unsplash

Using a pipeline to transform data will allow you to use the preprocessing step as a tunable hyperparameter during grid search. We do this by creating a class.
The custom transformer class should inherit the BaseEstimator and TransformerMixin class.
In this example, we will use the titanic data set. We aim to extract a new feature by combining two columns. We will combine the siblings/spouses and the parents/children columns, add both to extract a new feature family size.

Let’s look at the following class:

class CustomTransformer(TransformerMixin, BaseEstimator):
def __init__(self, combine_sibsp_parch=True):
self.combine_sibsp_parch = combine_sibsp_parch
def fit(self, X, y=None):
return self
def transform(self, X):
if self.combine_sibsp_parch:
return np.c_[X, X.sibsp + X.parch]
else:
return X
  • def __init__()
    The initialization method has the parameter combine_sibsp_parch.
  • def fit()
    The fit method returns the instance.
  • def transform()
    The transform method is where the transformation happens. In this case, we add both columns, add the result to the rest of the columns. However, if combine_sibsp_parch = False, we return the data unaltered.
np.c_[X, X.sibsp + X.parch]

Now, we create a pipeline and call the transform method on X_train. We can see that the number of the columns went from 13 to 14.

pipe = make_pipeline(CustomTransformer(combine_sibsp_parch=True))titanic_tr = pipe.transform(X_train)X_train.shape       # => (1047, 13)titanic_tr.shape    # => (1047, 14)

Finally, assert our result.

check_res = titanic_tr[:, 4] + titanic_tr[:, 5] == titanic_tr[:, -1]

The length of check_res is 1047. Using alien technology, we can reduce the vector to a single representative value.

reduce(lambda a, b: a == b, check_res)    # => True

You can find the complete code here:

--

--