.. currentmodule:: skrub

.. |ApplyToCols| replace:: :class:`ApplyToCols`
.. |RejectColumn| replace:: :class:`core.RejectColumn`
.. |SingleColumnTranformer| replace:: :class:`core.SingleColumnTranformer`
.. |ToDatetime| replace:: :class:`ToDatetime`

.. _user_guide_single_column_transformer:

Advanced columnwise operations
------------------------------

The single column transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In cases where we want to apply a custom transformation to a series we need the |ApplyToCols|
structure to handle multiple columns, and if this transformation needs to be able to reject certain
columns and communicate this to |ApplyToCols|, we must to create a transformer from scratch
that raises this exception when appropriate: this can be done with the |SingleColumnTranformer| class.

For instance, we might want to create a custom transformer specialized in parsing zip codes:
in this example, the zip codes need to have the format ``AB123``, that is two letters
followed by three digits.

>>> import pandas as pd
>>> df = pd.DataFrame({'sent': ["AB123", "BD601", "HS014"], 'received': ["AB1C45", "DU3K93", "WB9M88"]})
>>> df
    sent received
0  AB123   AB1C45
1  BD601   DU3K93
2  HS014   WB9M88

We would like to be able to "unpack" the zip code so that we have a column for the
letters and one for the digits; the transformer should also be able to "reject" a column
if it does not satisfy the format we specify. A "rejected" column should be passed
through unchanged, as it cannot be handled by this particular transformer.

We can therefore define a custom class that inherits from |SingleColumnTranformer|
and that raises |RejectColumn| if a column cannot be handled:

>>> from skrub.core import RejectColumn, SingleColumnTransformer
>>> class ZipcodeParser(SingleColumnTransformer):
...     def __init__(self):
...         return
...     def fit_transform(self, X, y=None):
...         if any(X.map(len) != 5):
...             raise RejectColumn('This transformer only takes zip codes of length 5.')
...         else:
...             letters = X.map(lambda s: s[:2])
...             try:
...                 numbers = X.map(lambda s: int(s[2:]))
...             except:
...                 raise RejectColumn('Input zip codes must consist of two letters followed by three numbers.')
...             return(pd.DataFrame({'letters': letters, 'numbers': numbers}))
>>> ZipcodeParser().fit_transform(df["sent"])
  letters  numbers
0      AB      123
1      BD      601
2      HS       14

We can use |ApplyToCols| to apply this transformer to the entire dataframe at once,
and set ``allow_reject=True`` to let rejected columns through without changes:

>>> from skrub import ApplyToCols
>>> ApplyToCols(ZipcodeParser(), allow_reject=True).fit_transform(df)
letters  numbers received
0      AB      123   AB1C45
1      BD      601   DU3K93
2      HS       14   WB9M88

Note how the ``"received"`` column has been "rejected" and passed through unmodified.


Rejection handling with |ApplyToCols| and |RejectColumn|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The combination |ApplyToCols| and |RejectColumn| allows allows flexible manipulation
and error checking of dataframe. In the previous example, we decided to ignore the
malformed ``"received"`` column by setting ``allow_reject=True``. If, however,
we want our transformer to fail if it encounters a column that it cannot parse,
we can keep the default value of ``allow_reject=False``, so that the transform
fails as soon as a malformed column is encountered:

>>> ApplyToCols(ZipcodeParser()).fit_transform(df)
Traceback (most recent call last):
    ...
ValueError: Transformer ZipcodeParser.fit_transform failed on column 'received'. ...

Letting rejected columns through can be useful for situations in which we do not
know the content of a column in advance, like when we are trying to convert to
datetime columns in a dataframe, without knowing which ones actually contain dates.

>>> from skrub import ToDatetime
>>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"]))
>>> df
        birthday    city
0  29/01/2024  London
>>> df.dtypes
birthday    ...
city        ...
dtype: object

Converting a datetime column would work:

>>> ToDatetime().fit_transform(df["birthday"])
0   2024-01-29
Name: birthday, dtype: datetime64[...]

While non-datetimes would raise |RejectColumn|:

>>> ToDatetime().fit_transform(df["city"])
Traceback (most recent call last):
    ...
skrub.core.RejectColumn: Could not find a datetime format for column 'city'.

The ``allow_reject`` parameter in |ApplyToCols| allows to apply the same transformer
to all columns without having to worry about which columns will actually be converted:
here, |ToDatetime| is applied only to the "birthday" column, while "city" is passed
through unchanged and no exception is raised.

>>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True)
>>> transformed = to_datetime.fit_transform(df)
>>> transformed
    birthday    city
0 2024-01-29  London

We can see that the only column that has a transformer is "birthday":

>>> to_datetime.transformers_
{'birthday': ToDatetime()}