Build Your Own Scikit-learn Transformer
Using a pipeline to transform data will allow you to use the preprocessing step as a tunable hyperparameter during grid search. We do this by creating a class.
The custom transformer class should inherit the BaseEstimator and TransformerMixin class.
In this example, we will use the titanic data set. We aim to extract a new feature by combining two columns. We will combine the siblings/spouses and the parents/children columns, add both to extract a new feature family size.
Let’s look at the following class:
class CustomTransformer(TransformerMixin, BaseEstimator):
def __init__(self, combine_sibsp_parch=True):
self.combine_sibsp_parch = combine_sibsp_parch
def fit(self, X, y=None):
return self
def transform(self, X):
if self.combine_sibsp_parch:
return np.c_[X, X.sibsp + X.parch]
else:
return X
- def __init__()
The initialization method has the parameter combine_sibsp_parch. - def fit()
The fit method returns the instance. - def transform()
The transform method is where the transformation happens. In this case, we add both columns, add the result to the rest of the columns. However, if combine_sibsp_parch = False, we return the data unaltered.
np.c_[X, X.sibsp + X.parch]
Now, we create a pipeline and call the transform method on X_train. We can see that the number of the columns went from 13 to 14.
pipe = make_pipeline(CustomTransformer(combine_sibsp_parch=True))titanic_tr = pipe.transform(X_train)X_train.shape # => (1047, 13)titanic_tr.shape # => (1047, 14)
Finally, assert our result.
check_res = titanic_tr[:, 4] + titanic_tr[:, 5] == titanic_tr[:, -1]
The length of check_res is 1047. Using alien technology, we can reduce the vector to a single representative value.
reduce(lambda a, b: a == b, check_res) # => True
You can find the complete code here: