6  Applying transformers to columns

6.1 Introduction

Often, transformers need to be applied only to a subset of columns, rather than the entire dataframe.

As an example, it does not make sense to apply a StandardScaler to a column that contains strings, and indeed doing so would raise an exception. In other cases, specific columns may need particular treatment, and should therefore be ignored by the Cleaner.

Scikit-learn provides the ColumnTransformer to deal with this:

import pandas as pd
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]})

categorical_columns = selector(dtype_include=object)(df)
numerical_columns = selector(dtype_exclude=object)(df)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))
transformed = ct.fit_transform(df)
transformed
array([[-1.22474487,  0.        ,  0.        ,  1.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.22474487,  0.        ,  1.        ,  0.        ]])

make_column_selector allows to choose columns based on their datatype, or by using regex to filter column names. In some cases, this degree of control is not sufficient.

To address such situations, skrub implements different transformers that allow to modify columns from within scikit-learn pipelines. Additionally, the selectors API allows to implement powerful, custom-made column selection filters.

SelectCols and DropCols are transformers that can be used as part of a pipeline to filter columns according to the selectors API, while ApplyToCols and ApplyToFrame replicate the ColumnTransformer behavior with a different syntax and access to the selectors.

6.2 Selection operations in a scikit-learn pipeline

SelectCols and DropCols allow selecting or removing specific columns in a dataframe according to user-provided rules. For example, to remove columns that include null values, or to select only columns that have a specific dtype.

SelectCols and DropCols take a cols parameter to choose which columns to select or drop respectively.

from skrub import ToDatetime
df = pd.DataFrame({
    "date": ["03 January 2023", "04 February 2023", "05 March 2023"],
    "values": [10, 20, 30]
})
df
date values
0 03 January 2023 10
1 04 February 2023 20
2 05 March 2023 30

We can selectively choose or drop columns based on names, or more complex rules (see the next chapter).

from skrub import SelectCols
SelectCols("date").fit_transform(df)
date
0 03 January 2023
1 04 February 2023
2 05 March 2023
from skrub import DropCols
DropCols("date").fit_transform(df)
values
0 10
1 20
2 30

6.3 ApplyToCols and ApplyToFrame

Besides selecting and dropping columns, pre-processing pipelines are intended to transform specific columns in specific ways. To make this process easier, skrub provides the ApplyToCols and ApplyToFrame transformers.

6.3.1 Applying a transformer to separate columns: ApplyToCols

In many cases, ApplyToCols can be a direct replacememnt for the ColumnTransformer, like in the following example:

import skrub.selectors as s
from sklearn.pipeline import make_pipeline
from skrub import ApplyToCols

numeric = ApplyToCols(StandardScaler(), cols=s.numeric())
string = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())

transformed = make_pipeline(numeric, string).fit_transform(df)
transformed
date_03 January 2023 date_04 February 2023 date_05 March 2023 values
0 1.0 0.0 0.0 -1.224745
1 0.0 1.0 0.0 0.000000
2 0.0 0.0 1.0 1.224745

In this case, we are applying the StandardScaler only to numeric features using s.numeric(), and OneHotEncoder with s.string().

Under the hood, ApplyToCol selects all columns that satisfy the condition specified in cols (in this case, that the dtype is numeric), then clones and applies the specified transformer (StandardScaler) to each column separately.

Important

Columns that are not selected are passed through without any change, thus string columns are not touched by the numeric transformer.

By passing through unselected columns without changes it is possible to chain several ApplyToCols together by putting them in a scikit-learn pipeline.

Important

ApplyToCols is intended to work on dataframes, which are dense. As a result, transformers that produce sparse outputs (like the OneHotEncoder) must be set so that their output is dense.

6.3.2 Applying the same transformer to multiple columns at once: ApplyToFrame

In some cases, there may be a need to apply the same transformer only to a subset of columns in a dataframe.

Consider this example dataframe, which some patient information, and some metrics.

import pandas as pd
import numpy as np

n_patients = 20
np.random.seed(42)
df = pd.DataFrame({
    "patient_id": [f"P{i:03d}" for i in range(n_patients)],
    "age": np.random.randint(18, 80, size=n_patients),
    "sex": np.random.choice(["M", "F"], size=n_patients),
})

for i in range(5):
    df[f"metric_{i}"] = np.random.normal(loc=50, scale=10, size=n_patients)

df["diagnosis"] = np.random.choice(["A", "B", "C"], size=n_patients)
df.head()
patient_id age sex metric_0 metric_1 metric_2 metric_3 metric_4 diagnosis
0 P000 56 F 39.871689 52.088636 41.607825 50.870471 52.961203 B
1 P001 69 M 53.142473 30.403299 46.907876 47.009926 52.610553 A
2 P002 46 F 40.919759 36.718140 53.312634 50.917608 50.051135 B
3 P003 32 F 35.876963 51.968612 59.755451 30.124311 47.654129 B
4 P004 60 F 64.656488 57.384666 45.208258 47.803281 35.846293 C

With ApplyToFrame, it is easy to apply a decomposition algorithm such as PCA to condense the metric_* columns into a smaller number of features:

from skrub import ApplyToFrame
from sklearn.decomposition import PCA

reduce = ApplyToFrame(PCA(n_components=2), cols=s.glob("metric_*"))

df_reduced = reduce.fit_transform(df)
df_reduced.head()
patient_id age sex diagnosis pca0 pca1
0 P000 56 F B -2.647377 7.025046
1 P001 69 M A -2.480564 -11.246997
2 P002 46 F B 4.274840 -5.039065
3 P003 32 F B 14.116747 15.620615
4 P004 60 F C -19.073862 1.186541

6.3.3 The allow_reject parameter

When ApplyToCols or ApplyToFrame are using a skrub transformer, they can use the allow_reject parameter for more flexibility. By setting allow_reject to True, columns that cannot be treated by the current transformer will be ignored rather than raising an exception.

Consider this example. By default, ToDatetime raises a RejectColumn exception when it finds a column it cannot convert to datetime.

from skrub import ToDatetime
df = pd.DataFrame({
    "date": ["03 January 2023", "04 February 2023", "05 March 2023"],
    "values": [10, 20, 30]
})
df
date values
0 03 January 2023 10
1 04 February 2023 20
2 05 March 2023 30

By setting allow_reject=True, the datetime column is converted properly and the other column is passed through without issues.

with_reject = ApplyToCols(ToDatetime(), allow_reject=True)
with_reject.fit_transform(df)
date values
0 2023-01-03 10
1 2023-02-04 20
2 2023-03-05 30

6.4 Concatenating the skrub column transformers

Skrub column transformers can be concatenated by using scikit-learn pipelines. In the following example, we first select only the column patient_id, then encode it using OneHotEncoder and finally use PCA to reduce the number of dimensions.

This is done by wrapping the latter two steps in ApplyToCols and ApplyToFrame respectively, and then putting all transformers in order in a scikit-learn pipeline using make_pipeline.

from sklearn.pipeline import make_pipeline
from skrub import SelectCols

df = pd.DataFrame({
    "patient_id": [f"P{i:03d}" for i in range(n_patients)],
    "age": np.random.randint(18, 80, size=n_patients),
    "sex": np.random.choice(["M", "F"], size=n_patients),
})

select = SelectCols("patient_id")
encode = ApplyToCols(OneHotEncoder(sparse_output=False))
reduce = ApplyToFrame(PCA(n_components=2))

transform = make_pipeline(select, encode, reduce)
dft= transform.fit_transform(df)
dft.head(5)
pca0 pca1
0 1.451188e-17 9.393890e-18
1 -2.405452e-02 9.397337e-01
2 -2.305851e-01 9.374222e-03
3 -5.287468e-02 9.374222e-03
4 -7.954573e-02 -6.807817e-03

6.4.1 The order of column transformations is important

Some care must be taken when concatenating columnn transformers, in particular when selection is done on datatypes. Consider this case:

encode = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())
scale = ApplyToCols(StandardScaler(), cols=s.numeric())

In the first case, we encode and then scale, in the second case we instead scale first and then encode.

transform_1 = make_pipeline(encode, scale)
dft = transform_1.fit_transform(df)
dft.head(5)
patient_id_P000 patient_id_P001 patient_id_P002 patient_id_P003 patient_id_P004 patient_id_P005 patient_id_P006 patient_id_P007 patient_id_P008 patient_id_P009 ... patient_id_P013 patient_id_P014 patient_id_P015 patient_id_P016 patient_id_P017 patient_id_P018 patient_id_P019 age sex_F sex_M
0 4.358899 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 ... -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -1.301570 0.904534 -0.904534
1 -0.229416 4.358899 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 ... -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.709947 0.904534 -0.904534
2 -0.229416 -0.229416 4.358899 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 ... -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 0.059162 -1.105542 1.105542
3 -0.229416 -0.229416 -0.229416 4.358899 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 ... -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -1.479057 -1.105542 1.105542
4 -0.229416 -0.229416 -0.229416 -0.229416 4.358899 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 ... -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.229416 -0.473298 0.904534 -0.904534

5 rows × 23 columns

transform_2 = make_pipeline(scale, encode)
dft = transform_2.fit_transform(df)
dft.head(5)
patient_id_P000 patient_id_P001 patient_id_P002 patient_id_P003 patient_id_P004 patient_id_P005 patient_id_P006 patient_id_P007 patient_id_P008 patient_id_P009 ... patient_id_P013 patient_id_P014 patient_id_P015 patient_id_P016 patient_id_P017 patient_id_P018 patient_id_P019 age sex_F sex_M
0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.301570 1.0 0.0
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.709947 1.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.059162 0.0 1.0
3 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.479057 0.0 1.0
4 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.473298 1.0 0.0

5 rows × 23 columns

The result of transform_1 is that the features that have been generated by the OneHotEncoder are then scaled by the StandardScaler, because the new features are numeric and are therefore selected in the next step.

In many cases, this behavior is not desired: while some model types may not be affected by the different ordering (such as tree-based models), linear models and NN-based models may produce worse results.

6.5 Conclusions

In this chapter we explored how skrub helps with selecting and transforming specific columns using various transformers. While these transformers can take simple lists of columns to work, they become far more flexible and powerful when they are combined with the skrub selectors, which is the subject of the next chapter.