ApplyToCols#

class skrub.ApplyToCols(transformer, cols=all(), allow_reject=False, keep_original=False, rename_columns='{}', n_jobs=None)[source]#

Map a transformer to columns in a dataframe.

A separate clone of the transformer is applied to each column separately. Moreover, if allow_reject is True and the transformers’ fit_transform raises a RejectColumn exception for a particular column, that column is passed through unchanged. If allow_reject is False, RejectColumn exceptions are propagated, like other errors raised by the transformer.

Note

The transform and fit_transform methods of transformer must return a column, a list of columns or a dataframe of the same module (polars or pandas) as the input, either by default or by supporting the scikit-learn set_output API.

Parameters:
transformerscikit-learn Transformer

The transformer to map to the selected columns. For each column in cols, a clone of the transformer is created then fit_transform is called on a single-column dataframe. If the transformer has a __single_column_transformer__ attribute, fit_transform is passed directly the column (a pandas or polars Series) rather than a DataFrame. fit_transform must return either a DataFrame, a Series, or a list of Series. fit_transform can raise RejectColumn to indicate that this transformer does not apply to this column – for example the ToDatetime transformer will raise RejectColumn for numerical columns. In this case, the column will appear unchanged in the output.

colsstr, sequence of str, or skrub selector, optional

The columns to attempt to transform. Columns outside of this selection will be passed through unchanged, without attempting to call fit_transform on them. The default is to attempt transforming all columns.

allow_rejectbool, default=False

Whether the transformer is allowed to reject a column by raising a RejectColumn exception. If True, rejected columns will be passed through unchanged by ApplyToCols and will not appear in attributes such as transformers_, used_inputs_, etc. If False, column rejections are considered as errors and RejectColumn exceptions are propagated.

keep_originalbool, default=False

If True, the original columns are preserved in the output. If the transformer produces a column with the same name, the transformation result is renamed so that both columns can appear in the output. If False, when the transformer accepts a column, only the transformer’s output is included in the result, not the original column. In all cases rejected columns (or columns not selected by cols) are passed through.

rename_columnsstr, default=’{}’

Format string applied to all transformation output column names. For example pass 'transformed_{}' to prepend 'transformed_' to all output column names. The default value does not modify the names. Renaming is not applied to columns not selected by cols.

n_jobsint, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib parallel_backend context. -1 means using all processors.

Attributes:
all_inputs_list of str

All column names in the input dataframe.

used_inputs_list of str

The names of columns that were transformed.

all_outputs_list of str

All column names in the output dataframe.

created_outputs_list of str

The names of columns in the output dataframe that were created by one of the fitted transformers.

input_to_outputs_dict

Maps the name of each column that was transformed to the list of the resulting columns’ names in the output.

output_to_input_dict

Maps the name of each column in the transformed output to the name of the input column from which it was derived.

transformers_dict

Maps the name of each column that was transformed to the corresponding fitted transformer.

Examples

>>> import pandas as pd
>>> from skrub import ApplyToCols
>>> from sklearn.preprocessing import StandardScaler
>>> df = pd.DataFrame(dict(A=[-10., 10.], B=[-10., 0.], C=[0., 10.]))
>>> df
      A     B     C
0 -10.0 -10.0   0.0
1  10.0   0.0  10.0

Fit a StandardScaler to each column in df:

>>> scaler = ApplyToCols(StandardScaler())
>>> scaler.fit_transform(df)
     A    B    C
0 -1.0 -1.0 -1.0
1  1.0  1.0  1.0
>>> scaler.transformers_
{'A': StandardScaler(), 'B': StandardScaler(), 'C': StandardScaler()}

We can restrict the columns on which the transformation is applied:

>>> scaler = ApplyToCols(StandardScaler(), cols=["A", "B"])
>>> scaler.fit_transform(df)
     A    B     C
0 -1.0 -1.0   0.0
1  1.0  1.0  10.0

We see that the scaling has not been applied to “C”, which also does not appear in the transformers_:

>>> scaler.transformers_
{'A': StandardScaler(), 'B': StandardScaler()}
>>> scaler.used_inputs_
['A', 'B']

Rejected columns

The transformer can raise RejectColumn to indicate it cannot handle a given column.

>>> from skrub._to_datetime import ToDatetime
>>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"]))
>>> df
     birthday    city
0  29/01/2024  London
>>> df.dtypes
birthday    object
city        object
dtype: object
>>> ToDatetime().fit_transform(df["birthday"])
0   2024-01-29
Name: birthday, dtype: datetime64[...]
>>> ToDatetime().fit_transform(df["city"])
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Could not find a datetime format for column 'city'.

How these rejections are handled depends on the allow_reject parameter. By default, no special handling is performed and rejections are considered to be errors:

>>> to_datetime = ApplyToCols(ToDatetime())
>>> to_datetime.fit_transform(df)
Traceback (most recent call last):
    ...
ValueError: Transformer ToDatetime.fit_transform failed on column 'city'. See above for the full traceback.

However, setting allow_reject=True gives the transformer itself some control over which columns it should be applied to. For example, whether a string column contains dates is only known once we try to parse them. Therefore it might be sensible to try to parse all string columns but allow the transformer to reject those that, upon inspection, do not contain dates.

>>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True)
>>> transformed = to_datetime.fit_transform(df)
>>> transformed
    birthday    city
0 2024-01-29  London

Now the column ‘city’ was rejected but this was not treated as an error; ‘city’ was passed through unchanged and only ‘birthday’ was converted to a datetime column.

>>> transformed.dtypes
birthday    datetime64[...]
city                object
dtype: object
>>> to_datetime.transformers_
{'birthday': ToDatetime()}

Renaming outputs & keeping the original columns

The rename_columns parameter allows renaming output columns.

>>> df = pd.DataFrame(dict(A=[-10., 10.], B=[0., 100.]))
>>> scaler = ApplyToCols(StandardScaler(), rename_columns='{}_scaled')
>>> scaler.fit_transform(df)
   A_scaled  B_scaled
0      -1.0      -1.0
1       1.0       1.0

The renaming is only applied to columns selected by cols (and not rejected by the transformer when allow_reject is True).

>>> scaler = ApplyToCols(StandardScaler(), cols=['A'], rename_columns='{}_scaled')
>>> scaler.fit_transform(df)
   A_scaled      B
0      -1.0    0.0
1       1.0  100.0

rename_columns can be particularly useful when keep_original is True. When a column is transformed, we can tell ApplyToCols to retain the original, untransformed column in the output. If the transformer produces a column with the same name, the transformation result is renamed to avoid a name clash.

>>> scaler = ApplyToCols(StandardScaler(), keep_original=True)
>>> scaler.fit_transform(df)
      A  A__skrub_89725c56__      B  B__skrub_81cc7d00__
0 -10.0                 -1.0    0.0                 -1.0
1  10.0                  1.0  100.0                  1.0

In this case we may want to set a more sensible name for the transformer’s output:

>>> scaler = ApplyToCols(
...     StandardScaler(), keep_original=True, rename_columns="{}_scaled"
... )
>>> scaler.fit_transform(df)
      A  A_scaled      B  B_scaled
0 -10.0      -1.0    0.0      -1.0
1  10.0       1.0  100.0       1.0

Methods

fit(X[, y])

Fit the transformer on each column independently.

fit_transform(X[, y])

Fit the transformer on each column independently and transform X.

get_feature_names_out()

Get output feature names for transformation.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform a dataframe.

fit(X, y=None)[source]#

Fit the transformer on each column independently.

Parameters:
XPandas or Polars DataFrame

The data to transform.

yPandas or Polars Series or DataFrame, default=None

The target data.

Returns:
ApplyToCols

The transformer itself.

fit_transform(X, y=None)[source]#

Fit the transformer on each column independently and transform X.

Parameters:
XPandas or Polars DataFrame

The data to transform.

yPandas or Polars Series or DataFrame, default=None

The target data.

Returns:
resultPandas or Polars DataFrame

The transformed data.

get_feature_names_out()[source]#

Get output feature names for transformation.

Returns:
feature_names_outndarray of str objects

Transformed feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Transform a dataframe.

Parameters:
XPandas or Polars DataFrame

The column to transform.

Returns:
resultPandas or Polars DataFrame

The transformed data.