5 Applying transformers to columns

5.1 Introduction

Often, transformers need to be applied only to a subset of columns, rather than the entire dataframe.

As an example, it does not make sense to apply a StandardScaler to a column that contains strings, and indeed doing so would raise an exception. In other cases, specific columns may need particular treatment, and should therefore be ignored by the Cleaner.

Scikit-learn provides the ColumnTransformer to deal with this:

import pandas as pd
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]})

categorical_columns = selector(dtype_include=object)(df)
numerical_columns = selector(dtype_exclude=object)(df)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))
transformed = ct.fit_transform(df)
transformed

array([[-1.22474487,  0.        ,  0.        ,  1.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.22474487,  0.        ,  1.        ,  0.        ]])

make_column_selector allows to choose columns based on their datatype, or by using regex to filter column names. In some cases, this degree of control is not sufficient.

To address such situations, skrub implements different transformers that allow to modify columns from within scikit-learn pipelines. Additionally, the selectors API allows to implement powerful, custom-made column selection filters.

SelectCols and DropCols are transformers that can be used as part of a pipeline to filter columns according to the selectors API, while ApplyToCols and ApplyToFrame replicate the ColumnTransformer behavior with a different syntax and access to the selectors.

5.2 Selection operations in a scikit-learn pipeline

SelectCols and DropCols allow selecting or removing specific columns in a dataframe according to user-provided rules. For example, to remove columns that include null values, or to select only columns that have a specific dtype.

SelectCols and DropCols take a cols parameter to choose which columns to select or drop respectively.

from skrub import ToDatetime
df = pd.DataFrame({
    "date": ["03 January 2023", "04 February 2023", "05 March 2023"],
    "values": [10, 20, 30]
})
df

	date	values
0	03 January 2023	10
1	04 February 2023	20
2	05 March 2023	30

We can selectively choose or drop columns based on names, or more complex rules (see the next chapter).

from skrub import SelectCols
SelectCols("date").fit_transform(df)

	date
0	03 January 2023
1	04 February 2023
2	05 March 2023

from skrub import DropCols
DropCols("date").fit_transform(df)

	values
0	10
1	20
2	30

5.3 `ApplyToCols` and `ApplyToFrame`

Besides selecting and dropping columns, pre-processing pipelines are intended to transform specific columns in specific ways. To make this process easier, skrub provides the ApplyToCols and ApplyToFrame transformers.

5.3.1 Applying a transformer to separate columns: `ApplyToCols`

In many cases, ApplyToCols can be a direct replacememnt for the ColumnTransformer, like in the following example:

import skrub.selectors as s
from sklearn.pipeline import make_pipeline
from skrub import ApplyToCols

numeric = ApplyToCols(StandardScaler(), cols=s.numeric())
string = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())

transformed = make_pipeline(numeric, string).fit_transform(df)
transformed

	date_03 January 2023	date_04 February 2023	date_05 March 2023	values
0	1.0	0.0	0.0	-1.224745
1	0.0	1.0	0.0	0.000000
2	0.0	0.0	1.0	1.224745

In this case, we are applying the StandardScaler only to numeric features using s.numeric(), and OneHotEncoder with s.string().

Under the hood, ApplyToCol selects all columns that satisfy the condition specified in cols (in this case, that the dtype is numeric), then clones and applies the specified transformer (StandardScaler) to each column separately.

Important

Columns that are not selected are passed through without any change, thus string columns are not touched by the numeric transformer.

By passing through unselected columns without changes it is possible to chain several ApplyToCols together by putting them in a scikit-learn pipeline.

Important

ApplyToCols is intended to work on dataframes, which are dense. As a result, transformers that produce sparse outputs (like the OneHotEncoder) must be set so that their output is dense.

5.3.2 Applying the same transformer to multiple columns at once: `ApplyToFrame`

In some cases, there may be a need to apply the same transformer only to a subset of columns in a dataframe.

Consider this example dataframe, which some patient information, and some metrics.

import pandas as pd
import numpy as np

n_patients = 20
np.random.seed(42)
df = pd.DataFrame({
    "patient_id": [f"P{i:03d}" for i in range(n_patients)],
    "age": np.random.randint(18, 80, size=n_patients),
    "sex": np.random.choice(["M", "F"], size=n_patients),
})

for i in range(5):
    df[f"metric_{i}"] = np.random.normal(loc=50, scale=10, size=n_patients)

df["diagnosis"] = np.random.choice(["A", "B", "C"], size=n_patients)
df.head()

	patient_id	age	sex	metric_0	metric_1	metric_2	metric_3	metric_4	diagnosis
0	P000	56	F	39.871689	52.088636	41.607825	50.870471	52.961203	B
1	P001	69	M	53.142473	30.403299	46.907876	47.009926	52.610553	A
2	P002	46	F	40.919759	36.718140	53.312634	50.917608	50.051135	B
3	P003	32	F	35.876963	51.968612	59.755451	30.124311	47.654129	B
4	P004	60	F	64.656488	57.384666	45.208258	47.803281	35.846293	C

With ApplyToFrame, it is easy to apply a decomposition algorithm such as PCA to condense the metric_* columns into a smaller number of features:

from skrub import ApplyToFrame
from sklearn.decomposition import PCA

reduce = ApplyToFrame(PCA(n_components=2), cols=s.glob("metric_*"))

df_reduced = reduce.fit_transform(df)
df_reduced.head()

	patient_id	age	sex	diagnosis	pca0	pca1
0	P000	56	F	B	-2.647377	7.025046
1	P001	69	M	A	-2.480564	-11.246997
2	P002	46	F	B	4.274840	-5.039065
3	P003	32	F	B	14.116747	15.620615
4	P004	60	F	C	-19.073862	1.186541

5.3.3 The `allow_reject` parameter

When ApplyToCols or ApplyToFrame are using a skrub transformer, they can use the allow_reject parameter for more flexibility. By setting allow_reject to True, columns that cannot be treated by the current transformer will be ignored rather than raising an exception.

Consider this example. By default, ToDatetime raises a RejectColumn exception when it finds a column it cannot convert to datetime.

from skrub import ToDatetime
df = pd.DataFrame({
    "date": ["03 January 2023", "04 February 2023", "05 March 2023"],
    "values": [10, 20, 30]
})
df

	date	values
0	03 January 2023	10
1	04 February 2023	20
2	05 March 2023	30

By setting allow_reject=True, the datetime column is converted properly and the other column is passed through without issues.

with_reject = ApplyToCols(ToDatetime(), allow_reject=True)
with_reject.fit_transform(df)

	date	values
0	2023-01-03	10
1	2023-02-04	20
2	2023-03-05	30

5.4 Concatenating the skrub column transformers

Skrub column transformers can be concatenated by using scikit-learn pipelines. In the following example, we first select only the column patient_id, then encode it using OneHotEncoder and finally use PCA to reduce the number of dimensions.

This is done by wrapping the latter two steps in ApplyToCols and ApplyToFrame respectively, and then putting all transformers in order in a scikit-learn pipeline using make_pipeline.

from sklearn.pipeline import make_pipeline
from skrub import SelectCols

df = pd.DataFrame({
    "patient_id": [f"P{i:03d}" for i in range(n_patients)],
    "age": np.random.randint(18, 80, size=n_patients),
    "sex": np.random.choice(["M", "F"], size=n_patients),
})

select = SelectCols("patient_id")
encode = ApplyToCols(OneHotEncoder(sparse_output=False))
reduce = ApplyToFrame(PCA(n_components=2))

transform = make_pipeline(select, encode, reduce)
dft= transform.fit_transform(df)
dft.head(5)

	pca0	pca1
0	1.451188e-17	9.393890e-18
1	-2.405452e-02	9.397337e-01
2	-2.305851e-01	9.374222e-03
3	-5.287468e-02	9.374222e-03
4	-7.954573e-02	-6.807817e-03

5.4.1 The order of column transformations is important

Some care must be taken when concatenating columnn transformers, in particular when selection is done on datatypes. Consider this case:

encode = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())
scale = ApplyToCols(StandardScaler(), cols=s.numeric())

In the first case, we encode and then scale, in the second case we instead scale first and then encode.

transform_1 = make_pipeline(encode, scale)
dft = transform_1.fit_transform(df)
dft.head(5)

	patient_id_P000	patient_id_P001	patient_id_P002	patient_id_P003	patient_id_P004	patient_id_P005	patient_id_P006	patient_id_P007	patient_id_P008	patient_id_P009	...	patient_id_P013	patient_id_P014	patient_id_P015	patient_id_P016	patient_id_P017	patient_id_P018	patient_id_P019	age	sex_F	sex_M
0	4.358899	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	...	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-1.301570	0.904534	-0.904534
1	-0.229416	4.358899	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	...	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.709947	0.904534	-0.904534
2	-0.229416	-0.229416	4.358899	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	...	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	0.059162	-1.105542	1.105542
3	-0.229416	-0.229416	-0.229416	4.358899	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	...	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-1.479057	-1.105542	1.105542
4	-0.229416	-0.229416	-0.229416	-0.229416	4.358899	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	...	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.229416	-0.473298	0.904534	-0.904534

5 rows × 23 columns

transform_2 = make_pipeline(scale, encode)
dft = transform_2.fit_transform(df)
dft.head(5)

	patient_id_P000	patient_id_P001	patient_id_P002	patient_id_P003	patient_id_P004	...	age	sex_F	sex_M
0	1.0	0.0	0.0	0.0	0.0	...	-1.301570	1.0	0.0
1	0.0	1.0	0.0	0.0	0.0	...	-0.709947	1.0	0.0
2	0.0	0.0	1.0	0.0	0.0	...	0.059162	0.0	1.0
3	0.0	0.0	0.0	1.0	0.0	...	-1.479057	0.0	1.0
4	0.0	0.0	0.0	0.0	1.0	...	-0.473298	1.0	0.0

5 rows × 23 columns

The result of transform_1 is that the features that have been generated by the OneHotEncoder are then scaled by the StandardScaler, because the new features are numeric and are therefore selected in the next step.

In many cases, this behavior is not desired: while some model types may not be affected by the different ordering (such as tree-based models), linear models and NN-based models may produce worse results.

5.5 Conclusions

In this chapter we explored how skrub helps with selecting and transforming specific columns using various transformers. While these transformers can take simple lists of columns to work, they become far more flexible and powerful when they are combined with the skrub selectors, which is the subject of the next chapter.

5.1 Introduction

5.2 Selection operations in a scikit-learn pipeline

5.3 ApplyToCols and ApplyToFrame

5.3.1 Applying a transformer to separate columns: ApplyToCols

5.3.2 Applying the same transformer to multiple columns at once: ApplyToFrame

5.3.3 The allow_reject parameter

5.4 Concatenating the skrub column transformers

5.4.1 The order of column transformations is important

5.5 Conclusions

5.3 `ApplyToCols` and `ApplyToFrame`

5.3.1 Applying a transformer to separate columns: `ApplyToCols`

5.3.2 Applying the same transformer to multiple columns at once: `ApplyToFrame`

5.3.3 The `allow_reject` parameter