Transforming selected columns with ApplyToCols#

Very often and for various reasons, transformers must be applied only to some of the columns in a dataframe. For example, all numeric columns in a dataframe may need to be scaled at the same time, while string columns should be left alone. While the heuristics used by the TableVectorizer are usually good enough to apply the proper transformers to different datatypes, using it may not be an option in all cases. In scikit-learn pipelines, the column selection operation can be done with the ColumnTransformer.

Skrub provides the ApplyToCols transformer to achieve the same results with a larger degree of control over which columns are being transformed. ApplyToCols maps a transformer to columns in a dataframe, so that all columns that satisfy a certain condition are transformed, while the others are left untouched.

Tip

If a skrub transformer has a cols parameter to specify a column list, that can be a selector as well. Selectors give more control over which columns are being transformed: they are discussed at length in the selectors user guide.

ApplyToCols can be used to transform a subset of columns in a dataframe, while leaving the non-selected columns unchanged. In this example, we want to apply an OrdinalEncoder only on the text column, and a StandardScaler on the numeric column. Columns that aren’t selected are passed through unchanged, and this allows to concatenate ApplyToCols transformers with make_pipeline.

>>> import pandas as pd
>>> df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]})

We use the string() selector to choose only the text column, and numeric() to select only the numeric column:

>>> import skrub.selectors as s
>>> from skrub import ApplyToCols
>>> from sklearn.preprocessing import OrdinalEncoder, StandardScaler
>>>
>>> numeric = ApplyToCols(StandardScaler(), cols=s.numeric())
>>> string = ApplyToCols(OrdinalEncoder(), cols=s.string())

We then concatenate the two with make_pipeline:

>>> from sklearn.pipeline import make_pipeline
>>> transformed = make_pipeline(numeric, string).fit_transform(df)
>>> transformed
     number  text
0 -1.224745   2.0
1  0.000000   0.0
2  1.224745   1.0

If ApplyToCols is used with a transformer that inherits from SingleColumnTransformer, or one that has the __single_column_transformer__ attribute, then the transformer will be cloned and applied separately to each column. Most skrub transformers belong to this category.

Here we want to apply ToDatetime to each of the datetime columns to convert them to datetime dtype. ApplyToCols automatically detects that ToDatetime should be applied to each column separately:

>>> from skrub._to_datetime import ToDatetime
>>> df = pd.DataFrame({
...     'date_1': ['2024-01-15', '2024-02-20', '2024-03-10'],
...     'date_2': ['2023-12-01', '2024-01-05', '2024-02-28']
... })
>>> df_enc = ApplyToCols(ToDatetime()).fit_transform(df)
>>> df_enc
      date_1     date_2
0 2024-01-15 2023-12-01
1 2024-02-20 2024-01-05
2 2024-03-10 2024-02-28
>>> df_enc.dtypes
date_1    datetime64[...]
date_2    datetime64[...]
dtype: ...

We can also combine ApplyToCols with TableVectorizer to only vectorize columns specific columns and avoid others, like ID columns:

>>> from skrub import TableVectorizer
>>> df = pd.DataFrame({
...     'id': ["c1", "c2", "c3"],
...     'city': ['Paris', 'Rome', 'Madrid'],
...     'date': ['2023-01-15', '2023-02-20', '2023-03-10']
... })
>>> ApplyToCols(TableVectorizer(), cols=s.all() - "id").fit_transform(df)
id  city_Madrid  city_Paris  city_Rome  date_year  date_month  date_day  date_total_seconds
0  c1          0.0         1.0        0.0     2023.0         1.0      15.0        1.673741e+09
1  c2          0.0         0.0        1.0     2023.0         2.0      20.0        1.676851e+09
2  c3          1.0         0.0        0.0     2023.0         3.0      10.0        1.678406e+09

Note that the column “id” was not encoded and was instead left as-is.

Dealing with columns that cannot be handled by a transformer#

ApplyToCols can allow the underlying encoder to decide which columns it can be applied to. For example, if we do not know in advance which columns can be transformed to datetime, we can use ApplyToCols to map ToDatetime to all columns in a dataframe and pass allow_reject=True. In that case, non-datetime columns. By default, all columns in cols must be transformed, and if one of them cannot be transformed an exception will be raised and the transformation will fail.

It is possible to change how rejected columns are handled through the allow_reject parameter. By default, no special handling is performed and rejections are considered to be errors:

>>> from skrub._to_datetime import ToDatetime
>>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"]))
>>> df
    birthday    city
0  29/01/2024  London
>>> to_datetime = ApplyToCols(ToDatetime())
>>> to_datetime.fit_transform(df)
Traceback (most recent call last):
    ...
ValueError: Transformer ToDatetime.fit_transform failed on column 'city'. See above for the full traceback.

However, setting allow_reject=True gives the transformer itself some control over which columns it should be applied to. For example, we can try to parse all columns but allow the transformer to reject those that, upon inspection, do not contain dates.

>>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True)
>>> transformed = to_datetime.fit_transform(df)
>>> transformed
    birthday    city
0 2024-01-29  London
>>> transformed.dtypes
birthday    datetime64[...]
city                ...
dtype: ...

Advanced usage of ApplyToCols#

For more advanced use cases, refer to the examples section of the ApplyToCols docstring, and to this user guide section.