.. currentmodule:: skrub

.. |ApplyToCols| replace:: :class:`ApplyToCols`
.. |ApplyToFrame| replace:: :class:`ApplyToFrame`
.. |SelectCols| replace:: :class:`SelectCols`
.. |DropCols| replace:: :class:`DropCols`
.. _user_guide_multiple_columns:

Operating over multiple columns at once
=======================================

Very often and for various reasons, transformers must be applied to multiple
columns at the same time. For example, all numeric columns in a dataframe may need
to be scaled at the same time.
While the heuristics used by the :class:`TableVectorizer` are usually good enough
to apply the proper transformers to different datatypes, using it may not be an
option in all cases. In scikit-learn pipelines, the column selection operation can
is done with the :class:`sklearn.compose.ColumnTransformer`:


>>> import pandas as pd
>>> from sklearn.compose import make_column_selector as selector
>>> from sklearn.compose import make_column_transformer
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>>
>>> df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]})
>>>
>>> categorical_columns = selector(dtype_include=object)(df)
>>> numerical_columns = selector(dtype_exclude=object)(df)
>>>
>>> ct = make_column_transformer(
...       (StandardScaler(),
...        numerical_columns),
...       (OneHotEncoder(handle_unknown="ignore"),
...        categorical_columns))
>>> transformed = ct.fit_transform(df)
>>> transformed
array([[-1.22474487,  0.        ,  0.        ,  1.        ],
       [ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.22474487,  0.        ,  1.        ,  0.        ]])

Skrub provides alternative transformers that can achieve the same results:

- |ApplyToCols| maps a transformer to columns in a dataframe, so that all
  columns that satisfy a certain condition are transformed, while the others are
  left untouched.
- |ApplyToFrame| applies a transformer to a collection of columns *at once*.
  This is different from |ApplyToCols|, which instead transforms each column
  one at a time.
- |SelectCols| allows specifying which columns should be kept.
- |DropCols| allows specifying the columns we want to discard.

|SelectCols| and |DropCols| can becombined with the skrub DataOps
to perform complex tasks
such as feature selection: refer to :ref:`user_guide_data_ops_feature_selection`
for more details.

All multi-column transformers provided by skrub can take skrub selectors as
parameters to have more control over the columns that are being transformed.
Skrub selectors are discussed at length in :ref:`user_guide_selectors`.


Applying transformations to the columns with |ApplyToCols| and |ApplyToFrame|
-----------------------------------------------------------------------------
|ApplyToCols| can be used to transform a subset of columns in a dataframe, while
leaving the remaining columns unchanged. It simplifies operations such as the
example above, which can be rewritten with |ApplyToCols| as follows:

>>> import skrub.selectors as s
>>> from sklearn.pipeline import make_pipeline
>>> from skrub import ApplyToCols
>>>
>>> numeric = ApplyToCols(StandardScaler(), cols=s.numeric())
>>> string = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())
>>>
>>> transformed = make_pipeline(numeric, string).fit_transform(df)
>>> transformed
   text_bar  text_baz  text_foo    number
0       0.0       0.0       1.0 -1.224745
1       1.0       0.0       0.0  0.000000
2       0.0       1.0       0.0  1.224745

|ApplyToCols| can raise a ``RejectColumn`` exception if it cannot handle a specific
column:

>>> from skrub._to_datetime import ToDatetime
>>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"]))
>>> df
    birthday    city
0  29/01/2024  London
>>> df.dtypes
birthday    object
city        object
dtype: object
>>> ToDatetime().fit_transform(df["birthday"])
0   2024-01-29
Name: birthday, dtype: datetime64[...]
>>> ToDatetime().fit_transform(df["city"])
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Could not find a datetime format for column 'city'.

It is possible to change how rejected columns are handled through the ``allow_reject``
parameter.
By default, no special handling is performed and rejections are considered
to be errors:

>>> to_datetime = ApplyToCols(ToDatetime())
>>> to_datetime.fit_transform(df)
Traceback (most recent call last):
    ...
ValueError: Transformer ToDatetime.fit_transform failed on column 'city'. See above for the full traceback.

However, setting ``allow_reject=True`` gives the transformer itself some
control over which columns it should be applied to. For example, whether a
string column contains dates is only known once we try to parse them.
Therefore it might be sensible to try to parse all string columns but allow
the transformer to reject those that, upon inspection, do not contain dates.

>>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True)
>>> transformed = to_datetime.fit_transform(df)
>>> transformed
    birthday    city
0 2024-01-29  London

Here, the column 'city' was rejected without being treated as an error: it was
was passed through unchanged and only ``birthday`` was converted to a
datetime column.

>>> transformed.dtypes
birthday    datetime64[...]
city                object
dtype: object


|ApplyToFrame| is instead used in cases where multiple columns should be transformed
at once. This is the case when the transformer is expecting multiple columns at
once, e.g., to perform dimensionality reduction:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.eye(4) * np.logspace(0, 3, 4), columns=list("abcd"))
>>> df
     a     b      c       d
0  1.0   0.0    0.0     0.0
1  0.0  10.0    0.0     0.0
2  0.0   0.0  100.0     0.0
3  0.0   0.0    0.0  1000.0

>>> from sklearn.decomposition import PCA
>>> from skrub import ApplyToFrame

Like with the other transformers described here, it is possible to limit the
transformations to a subset of columns:

>>> pca = ApplyToFrame(PCA(n_components=2), cols=["a", "b"])
>>> pca.fit_transform(df).round(2)
       c       d  pca0  pca1
0    0.0     0.0 -2.52  0.67
1    0.0     0.0  7.50  0.00
2  100.0     0.0 -2.49 -0.33
3    0.0  1000.0 -2.49 -0.33

By default, |ApplyToCols| and |ApplyToFrame| rename the transformed columns, and
remove the original features from the data. It is possible to rename the columns
by providing a formatting string to the ``rename_columns`` parameter:

>>> from sklearn.preprocessing import StandardScaler
>>> df = pd.DataFrame(dict(A=[-10., 10.], B=[0., 100.]))
>>> scaler = ApplyToCols(StandardScaler(), rename_columns='{}_scaled')
>>> scaler.fit_transform(df)
    A_scaled  B_scaled
0      -1.0      -1.0
1       1.0       1.0

By setting ``keep_original=True``, the starting columns are not dropped from the
transformed dataframe. The ``rename_columns`` parameter can be used to avoid
name collisions:

>>> scaler = ApplyToCols(
...     StandardScaler(), keep_original=True, rename_columns="{}_scaled"
... )
>>> scaler.fit_transform(df)
        A  A_scaled      B  B_scaled
0 -10.0      -1.0    0.0      -1.0
1  10.0       1.0  100.0       1.0