Advanced columnwise operations#
The single column transformer#
In cases where we want to apply a custom transformation to a series we need the ApplyToCols
structure to handle multiple columns, and if this transformation needs to be able to reject certain
columns and communicate this to ApplyToCols, we must to create a transformer from scratch
that raises this exception when appropriate: this can be done with the core.SingleColumnTranformer class.
For instance, we might want to create a custom transformer specialized in parsing zip codes:
in this example, the zip codes need to have the format AB123, that is two letters
followed by three digits.
>>> import pandas as pd
>>> df = pd.DataFrame({'sent': ["AB123", "BD601", "HS014"], 'received': ["AB1C45", "DU3K93", "WB9M88"]})
>>> df
sent received
0 AB123 AB1C45
1 BD601 DU3K93
2 HS014 WB9M88
We would like to be able to “unpack” the zip code so that we have a column for the letters and one for the digits; the transformer should also be able to “reject” a column if it does not satisfy the format we specify. A “rejected” column should be passed through unchanged, as it cannot be handled by this particular transformer.
We can therefore define a custom class that inherits from core.SingleColumnTranformer
and that raises core.RejectColumn if a column cannot be handled:
>>> from skrub.core import RejectColumn, SingleColumnTransformer
>>> class ZipcodeParser(SingleColumnTransformer):
... def __init__(self):
... return
... def fit_transform(self, X, y=None):
... if any(X.map(len) != 5):
... raise RejectColumn('This transformer only takes zip codes of length 5.')
... else:
... letters = X.map(lambda s: s[:2])
... try:
... numbers = X.map(lambda s: int(s[2:]))
... except:
... raise RejectColumn('Input zip codes must consist of two letters followed by three numbers.')
... return(pd.DataFrame({'letters': letters, 'numbers': numbers}))
>>> ZipcodeParser().fit_transform(df["sent"])
letters numbers
0 AB 123
1 BD 601
2 HS 14
We can use ApplyToCols to apply this transformer to the entire dataframe at once,
and set allow_reject=True to let rejected columns through without changes:
>>> from skrub import ApplyToCols
>>> ApplyToCols(ZipcodeParser(), allow_reject=True).fit_transform(df)
letters numbers received
0 AB 123 AB1C45
1 BD 601 DU3K93
2 HS 14 WB9M88
Note how the "received" column has been “rejected” and passed through unmodified.
Rejection handling with ApplyToCols and core.RejectColumn#
The combination ApplyToCols and core.RejectColumn allows allows flexible manipulation
and error checking of dataframe. In the previous example, we decided to ignore the
malformed "received" column by setting allow_reject=True. If, however,
we want our transformer to fail if it encounters a column that it cannot parse,
we can keep the default value of allow_reject=False, so that the transform
fails as soon as a malformed column is encountered:
>>> ApplyToCols(ZipcodeParser()).fit_transform(df)
Traceback (most recent call last):
...
ValueError: Transformer ZipcodeParser.fit_transform failed on column 'received'. ...
Letting rejected columns through can be useful for situations in which we do not know the content of a column in advance, like when we are trying to convert to datetime columns in a dataframe, without knowing which ones actually contain dates.
>>> from skrub import ToDatetime
>>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"]))
>>> df
birthday city
0 29/01/2024 London
>>> df.dtypes
birthday ...
city ...
dtype: object
Converting a datetime column would work:
>>> ToDatetime().fit_transform(df["birthday"])
0 2024-01-29
Name: birthday, dtype: datetime64[...]
While non-datetimes would raise core.RejectColumn:
>>> ToDatetime().fit_transform(df["city"])
Traceback (most recent call last):
...
skrub.core.RejectColumn: Could not find a datetime format for column 'city'.
The allow_reject parameter in ApplyToCols allows to apply the same transformer
to all columns without having to worry about which columns will actually be converted:
here, ToDatetime is applied only to the “birthday” column, while “city” is passed
through unchanged and no exception is raised.
>>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True)
>>> transformed = to_datetime.fit_transform(df)
>>> transformed
birthday city
0 2024-01-29 London
We can see that the only column that has a transformer is “birthday”:
>>> to_datetime.transformers_
{'birthday': ToDatetime()}