Operating over multiple columns at once#
Very often and for various reasons, transformers must be applied to multiple
columns at the same time. For example, all numeric columns in a dataframe may need
to be scaled at the same time.
While the heuristics used by the TableVectorizer
are usually good enough
to apply the proper transformers to different datatypes, using it may not be an
option in all cases. In scikit-learn pipelines, the column selection operation can
is done with the sklearn.compose.ColumnTransformer
:
>>> import pandas as pd
>>> from sklearn.compose import make_column_selector as selector
>>> from sklearn.compose import make_column_transformer
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>>
>>> df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]})
>>>
>>> categorical_columns = selector(dtype_include=object)(df)
>>> numerical_columns = selector(dtype_exclude=object)(df)
>>>
>>> ct = make_column_transformer(
... (StandardScaler(),
... numerical_columns),
... (OneHotEncoder(handle_unknown="ignore"),
... categorical_columns))
>>> transformed = ct.fit_transform(df)
>>> transformed
array([[-1.22474487, 0. , 0. , 1. ],
[ 0. , 1. , 0. , 0. ],
[ 1.22474487, 0. , 1. , 0. ]])
Skrub provides alternative transformers that can achieve the same results:
ApplyToCols
maps a transformer to columns in a dataframe, so that all columns that satisfy a certain condition are transformed, while the others are left untouched.ApplyToFrame
applies a transformer to a collection of columns at once. This is different fromApplyToCols
, which instead transforms each column one at a time.SelectCols
allows specifying which columns should be kept.DropCols
allows specifying the columns we want to discard.
SelectCols
and DropCols
can becombined with the skrub DataOps
to perform complex tasks
such as feature selection: refer to Feature selection with skrub SelectCols and DropCols
for more details.
All multi-column transformers provided by skrub can take skrub selectors as parameters to have more control over the columns that are being transformed. Skrub selectors are discussed at length in Skrub Selectors: helpers for selecting columns in a dataframe.
Applying transformations to the columns with ApplyToCols
and ApplyToFrame
#
ApplyToCols
can be used to transform a subset of columns in a dataframe, while
leaving the remaining columns unchanged. It simplifies operations such as the
example above, which can be rewritten with ApplyToCols
as follows:
>>> import skrub.selectors as s
>>> from sklearn.pipeline import make_pipeline
>>> from skrub import ApplyToCols
>>>
>>> numeric = ApplyToCols(StandardScaler(), cols=s.numeric())
>>> string = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())
>>>
>>> transformed = make_pipeline(numeric, string).fit_transform(df)
>>> transformed
text_bar text_baz text_foo number
0 0.0 0.0 1.0 -1.224745
1 1.0 0.0 0.0 0.000000
2 0.0 1.0 0.0 1.224745
ApplyToCols
can raise a RejectColumn
exception if it cannot handle a specific
column:
>>> from skrub._to_datetime import ToDatetime
>>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"]))
>>> df
birthday city
0 29/01/2024 London
>>> df.dtypes
birthday object
city object
dtype: object
>>> ToDatetime().fit_transform(df["birthday"])
0 2024-01-29
Name: birthday, dtype: datetime64[...]
>>> ToDatetime().fit_transform(df["city"])
Traceback (most recent call last):
...
skrub._apply_to_cols.RejectColumn: Could not find a datetime format for column 'city'.
It is possible to change how rejected columns are handled through the allow_reject
parameter.
By default, no special handling is performed and rejections are considered
to be errors:
>>> to_datetime = ApplyToCols(ToDatetime())
>>> to_datetime.fit_transform(df)
Traceback (most recent call last):
...
ValueError: Transformer ToDatetime.fit_transform failed on column 'city'. See above for the full traceback.
However, setting allow_reject=True
gives the transformer itself some
control over which columns it should be applied to. For example, whether a
string column contains dates is only known once we try to parse them.
Therefore it might be sensible to try to parse all string columns but allow
the transformer to reject those that, upon inspection, do not contain dates.
>>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True)
>>> transformed = to_datetime.fit_transform(df)
>>> transformed
birthday city
0 2024-01-29 London
Here, the column ‘city’ was rejected without being treated as an error: it was
was passed through unchanged and only birthday
was converted to a
datetime column.
>>> transformed.dtypes
birthday datetime64[...]
city object
dtype: object
ApplyToFrame
is instead used in cases where multiple columns should be transformed
at once. This is the case when the transformer is expecting multiple columns at
once, e.g., to perform dimensionality reduction:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.eye(4) * np.logspace(0, 3, 4), columns=list("abcd"))
>>> df
a b c d
0 1.0 0.0 0.0 0.0
1 0.0 10.0 0.0 0.0
2 0.0 0.0 100.0 0.0
3 0.0 0.0 0.0 1000.0
>>> from sklearn.decomposition import PCA
>>> from skrub import ApplyToFrame
Like with the other transformers described here, it is possible to limit the transformations to a subset of columns:
>>> pca = ApplyToFrame(PCA(n_components=2), cols=["a", "b"])
>>> pca.fit_transform(df).round(2)
c d pca0 pca1
0 0.0 0.0 -2.52 0.67
1 0.0 0.0 7.50 0.00
2 100.0 0.0 -2.49 -0.33
3 0.0 1000.0 -2.49 -0.33
By default, ApplyToCols
and ApplyToFrame
rename the transformed columns, and
remove the original features from the data. It is possible to rename the columns
by providing a formatting string to the rename_columns
parameter:
>>> from sklearn.preprocessing import StandardScaler
>>> df = pd.DataFrame(dict(A=[-10., 10.], B=[0., 100.]))
>>> scaler = ApplyToCols(StandardScaler(), rename_columns='{}_scaled')
>>> scaler.fit_transform(df)
A_scaled B_scaled
0 -1.0 -1.0
1 1.0 1.0
By setting keep_original=True
, the starting columns are not dropped from the
transformed dataframe. The rename_columns
parameter can be used to avoid
name collisions:
>>> scaler = ApplyToCols(
... StandardScaler(), keep_original=True, rename_columns="{}_scaled"
... )
>>> scaler.fit_transform(df)
A A_scaled B B_scaled
0 -10.0 -1.0 0.0 -1.0
1 10.0 1.0 100.0 1.0