Often, transformers need to be applied only to a subset of columns, rather than the entire dataframe.
As an example, it does not make sense to apply a StandardScaler to a column that contains strings, and indeed doing so would raise an exception. In other cases, specific columns may need particular treatment, and should therefore be ignored by the Cleaner.
Scikit-learn provides the ColumnTransformer to deal with this:
make_column_selector allows to choose columns based on their datatype, or by using regex to filter column names. In some cases, this degree of control is not sufficient.
To address such situations, skrub implements different transformers that allow to modify columns from within scikit-learn pipelines. Additionally, the selectors API allows to implement powerful, custom-made column selection filters.
SelectCols and DropCols are transformers that can be used as part of a pipeline to filter columns according to the selectors API, while ApplyToCols and ApplyToFrame replicate the ColumnTransformer behavior with a different syntax and access to the selectors.
6.2 Selection operations in a scikit-learn pipeline
SelectCols and DropCols allow selecting or removing specific columns in a dataframe according to user-provided rules. For example, to remove columns that include null values, or to select only columns that have a specific dtype.
SelectCols and DropCols take a cols parameter to choose which columns to select or drop respectively.
from skrub import ToDatetimedf = pd.DataFrame({"date": ["03 January 2023", "04 February 2023", "05 March 2023"],"values": [10, 20, 30]})df
date
values
0
03 January 2023
10
1
04 February 2023
20
2
05 March 2023
30
We can selectively choose or drop columns based on names, or more complex rules (see the next chapter).
from skrub import SelectColsSelectCols("date").fit_transform(df)
date
0
03 January 2023
1
04 February 2023
2
05 March 2023
from skrub import DropColsDropCols("date").fit_transform(df)
values
0
10
1
20
2
30
6.3ApplyToCols and ApplyToFrame
Besides selecting and dropping columns, pre-processing pipelines are intended to transform specific columns in specific ways. To make this process easier, skrub provides the ApplyToCols and ApplyToFrame transformers.
6.3.1 Applying a transformer to separate columns: ApplyToCols
In many cases, ApplyToCols can be a direct replacememnt for the ColumnTransformer, like in the following example:
In this case, we are applying the StandardScaler only to numeric features using s.numeric(), and OneHotEncoder with s.string().
Under the hood, ApplyToCol selects all columns that satisfy the condition specified in cols (in this case, that the dtype is numeric), then clones and applies the specified transformer (StandardScaler) to each column separately.
Important
Columns that are not selected are passed through without any change, thus string columns are not touched by the numeric transformer.
By passing through unselected columns without changes it is possible to chain several ApplyToCols together by putting them in a scikit-learn pipeline.
Important
ApplyToCols is intended to work on dataframes, which are dense. As a result, transformers that produce sparse outputs (like the OneHotEncoder) must be set so that their output is dense.
6.3.2 Applying the same transformer to multiple columns at once: ApplyToFrame
In some cases, there may be a need to apply the same transformer only to a subset of columns in a dataframe.
Consider this example dataframe, which some patient information, and some metrics.
import pandas as pdimport numpy as npn_patients =20np.random.seed(42)df = pd.DataFrame({"patient_id": [f"P{i:03d}"for i inrange(n_patients)],"age": np.random.randint(18, 80, size=n_patients),"sex": np.random.choice(["M", "F"], size=n_patients),})for i inrange(5): df[f"metric_{i}"] = np.random.normal(loc=50, scale=10, size=n_patients)df["diagnosis"] = np.random.choice(["A", "B", "C"], size=n_patients)df.head()
patient_id
age
sex
metric_0
metric_1
metric_2
metric_3
metric_4
diagnosis
0
P000
56
F
39.871689
52.088636
41.607825
50.870471
52.961203
B
1
P001
69
M
53.142473
30.403299
46.907876
47.009926
52.610553
A
2
P002
46
F
40.919759
36.718140
53.312634
50.917608
50.051135
B
3
P003
32
F
35.876963
51.968612
59.755451
30.124311
47.654129
B
4
P004
60
F
64.656488
57.384666
45.208258
47.803281
35.846293
C
With ApplyToFrame, it is easy to apply a decomposition algorithm such as PCA to condense the metric_* columns into a smaller number of features:
from skrub import ApplyToFramefrom sklearn.decomposition import PCAreduce= ApplyToFrame(PCA(n_components=2), cols=s.glob("metric_*"))df_reduced =reduce.fit_transform(df)df_reduced.head()
patient_id
age
sex
diagnosis
pca0
pca1
0
P000
56
F
B
-2.647377
7.025046
1
P001
69
M
A
-2.480564
-11.246997
2
P002
46
F
B
4.274840
-5.039065
3
P003
32
F
B
14.116747
15.620615
4
P004
60
F
C
-19.073862
1.186541
6.3.3 The allow_reject parameter
When ApplyToCols or ApplyToFrame are using a skrub transformer, they can use the allow_reject parameter for more flexibility. By setting allow_reject to True, columns that cannot be treated by the current transformer will be ignored rather than raising an exception.
Consider this example. By default, ToDatetime raises a RejectColumn exception when it finds a column it cannot convert to datetime.
from skrub import ToDatetimedf = pd.DataFrame({"date": ["03 January 2023", "04 February 2023", "05 March 2023"],"values": [10, 20, 30]})df
date
values
0
03 January 2023
10
1
04 February 2023
20
2
05 March 2023
30
By setting allow_reject=True, the datetime column is converted properly and the other column is passed through without issues.
Skrub column transformers can be concatenated by using scikit-learn pipelines. In the following example, we first select only the column patient_id, then encode it using OneHotEncoder and finally use PCA to reduce the number of dimensions.
This is done by wrapping the latter two steps in ApplyToCols and ApplyToFrame respectively, and then putting all transformers in order in a scikit-learn pipeline using make_pipeline.
The result of transform_1 is that the features that have been generated by the OneHotEncoder are then scaled by the StandardScaler, because the new features are numeric and are therefore selected in the next step.
In many cases, this behavior is not desired: while some model types may not be affected by the different ordering (such as tree-based models), linear models and NN-based models may produce worse results.
6.5 Conclusions
In this chapter we explored how skrub helps with selecting and transforming specific columns using various transformers. While these transformers can take simple lists of columns to work, they become far more flexible and powerful when they are combined with the skrub selectors, which is the subject of the next chapter.