Operating over multiple columns at once#

Very often and for various reasons, transformers must be applied to multiple columns at the same time. For example, all numeric columns in a dataframe may need to be scaled at the same time. While the heuristics used by the TableVectorizer are usually good enough to apply the proper transformers to different datatypes, using it may not be an option in all cases. In scikit-learn pipelines, the column selection operation can is done with the sklearn.compose.ColumnTransformer:

>>> import pandas as pd
>>> from sklearn.compose import make_column_selector as selector
>>> from sklearn.compose import make_column_transformer
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>>
>>> df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]})
>>>
>>> categorical_columns = selector(dtype_include=object)(df)
>>> numerical_columns = selector(dtype_exclude=object)(df)
>>>
>>> ct = make_column_transformer(
...       (StandardScaler(),
...        numerical_columns),
...       (OneHotEncoder(handle_unknown="ignore"),
...        categorical_columns))
>>> transformed = ct.fit_transform(df)
>>> transformed
array([[-1.22474487,  0.        ,  0.        ,  1.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.22474487,  0.        ,  1.        ,  0.        ]])

Skrub provides alternative transformers that can achieve the same results:

  • ApplyToCols maps a transformer to columns in a dataframe, so that all columns that satisfy a certain condition are transformed, while the others are left untouched.

  • ApplyToFrame applies a transformer to a collection of columns at once. This is different from ApplyToCols, which instead transforms each column one at a time.

  • SelectCols allows specifying which columns should be kept.

  • DropCols allows specifying the columns we want to discard.

SelectCols and DropCols can becombined with the skrub DataOps to perform complex tasks such as feature selection: refer to Feature selection with skrub SelectCols and DropCols for more details.

All multi-column transformers provided by skrub can take skrub selectors as parameters to have more control over the columns that are being transformed. Skrub selectors are discussed at length in Skrub Selectors: helpers for selecting columns in a dataframe.

Applying transformations to the columns with ApplyToCols and ApplyToFrame#

ApplyToCols can be used to transform a subset of columns in a dataframe, while leaving the remaining columns unchanged. It simplifies operations such as the example above, which can be rewritten with ApplyToCols as follows:

>>> import skrub.selectors as s
>>> from sklearn.pipeline import make_pipeline
>>> from skrub import ApplyToCols
>>>
>>> numeric = ApplyToCols(StandardScaler(), cols=s.numeric())
>>> string = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())
>>>
>>> transformed = make_pipeline(numeric, string).fit_transform(df)
>>> transformed
   text_bar  text_baz  text_foo    number
0       0.0       0.0       1.0 -1.224745
1       1.0       0.0       0.0  0.000000
2       0.0       1.0       0.0  1.224745

ApplyToCols can raise a RejectColumn exception if it cannot handle a specific column:

>>> from skrub._to_datetime import ToDatetime
>>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"]))
>>> df
    birthday    city
0  29/01/2024  London
>>> df.dtypes
birthday    object
city        object
dtype: object
>>> ToDatetime().fit_transform(df["birthday"])
0   2024-01-29
Name: birthday, dtype: datetime64[...]
>>> ToDatetime().fit_transform(df["city"])
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Could not find a datetime format for column 'city'.

It is possible to change how rejected columns are handled through the allow_reject parameter. By default, no special handling is performed and rejections are considered to be errors:

>>> to_datetime = ApplyToCols(ToDatetime())
>>> to_datetime.fit_transform(df)
Traceback (most recent call last):
    ...
ValueError: Transformer ToDatetime.fit_transform failed on column 'city'. See above for the full traceback.

However, setting allow_reject=True gives the transformer itself some control over which columns it should be applied to. For example, whether a string column contains dates is only known once we try to parse them. Therefore it might be sensible to try to parse all string columns but allow the transformer to reject those that, upon inspection, do not contain dates.

>>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True)
>>> transformed = to_datetime.fit_transform(df)
>>> transformed
    birthday    city
0 2024-01-29  London

Here, the column ‘city’ was rejected without being treated as an error: it was was passed through unchanged and only birthday was converted to a datetime column.

>>> transformed.dtypes
birthday    datetime64[...]
city                object
dtype: object

ApplyToFrame is instead used in cases where multiple columns should be transformed at once. This is the case when the transformer is expecting multiple columns at once, e.g., to perform dimensionality reduction:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.eye(4) * np.logspace(0, 3, 4), columns=list("abcd"))
>>> df
     a     b      c       d
0  1.0   0.0    0.0     0.0
1  0.0  10.0    0.0     0.0
2  0.0   0.0  100.0     0.0
3  0.0   0.0    0.0  1000.0
>>> from sklearn.decomposition import PCA
>>> from skrub import ApplyToFrame

Like with the other transformers described here, it is possible to limit the transformations to a subset of columns:

>>> pca = ApplyToFrame(PCA(n_components=2), cols=["a", "b"])
>>> pca.fit_transform(df).round(2)
       c       d  pca0  pca1
0    0.0     0.0 -2.52  0.67
1    0.0     0.0  7.50  0.00
2  100.0     0.0 -2.49 -0.33
3    0.0  1000.0 -2.49 -0.33

By default, ApplyToCols and ApplyToFrame rename the transformed columns, and remove the original features from the data. It is possible to rename the columns by providing a formatting string to the rename_columns parameter:

>>> from sklearn.preprocessing import StandardScaler
>>> df = pd.DataFrame(dict(A=[-10., 10.], B=[0., 100.]))
>>> scaler = ApplyToCols(StandardScaler(), rename_columns='{}_scaled')
>>> scaler.fit_transform(df)
    A_scaled  B_scaled
0      -1.0      -1.0
1       1.0       1.0

By setting keep_original=True, the starting columns are not dropped from the transformed dataframe. The rename_columns parameter can be used to avoid name collisions:

>>> scaler = ApplyToCols(
...     StandardScaler(), keep_original=True, rename_columns="{}_scaled"
... )
>>> scaler.fit_transform(df)
        A  A_scaled      B  B_scaled
0 -10.0      -1.0    0.0      -1.0
1  10.0       1.0  100.0       1.0