Build Your Own Scikit-learn Transformer

Photo by Victor on Unsplash

Using a pipeline to transform data will allow you to use the preprocessing step as a tunable hyperparameter during grid search. We do this by creating a class.
The custom transformer class should inherit the BaseEstimator and TransformerMixin class.
In this example, we will use the titanic data set. We aim to extract a new feature by combining two columns. We will combine the siblings/spouses and the parents/children columns, add both to extract a new feature family size.

Let’s look at the following class:

  • def __init__()
    The initialization method has the parameter combine_sibsp_parch.
  • def fit()
    The fit method returns the instance.
  • def transform()
    The transform method is where the transformation happens. In this case, we add both columns, add the result to the rest of the columns. However, if combine_sibsp_parch = False, we return the data unaltered.

Now, we create a pipeline and call the transform method on X_train. We can see that the number of the columns went from 13 to 14.

Finally, assert our result.

The length of check_res is 1047. Using alien technology, we can reduce the vector to a single representative value.

You can find the complete code here:

Programming and Statistics for now!