ApplyToFrame#

class skrub.ApplyToFrame(transformer, cols=all(), keep_original=False, rename_columns='{}')[source]#

Apply a transformer to part of a dataframe.

A subset of the dataframe is selected and passed to the transformer (as a single input). This is different from ApplyToCols, which fits a separate clone of the transformer to each selected column independently.

Note

The transform and fit_transform methods of transformer must return dataframes of the same type (polars or pandas) as the input, either by default or by supporting the scikit-learn set_output API.

Parameters:
transformerscikit-learn Transformer

The transformer to apply to the selected columns. fit_transform and transform must return a DataFrame. The resulting dataframe will appear as the last columns of the output dataframe. Unselected columns will appear unchanged in the output.

colsstr, sequence of str, or skrub selector, optional

The columns to attempt to transform. Columns outside of this selection will be passed through unchanged, without calling fit_transform on them. The default is to transform all columns.

keep_originalbool, default=False

If True, the original columns are preserved in the output. If the transformer produces a column with the same name, the transformation result is renamed so that both columns can appear in the output. If False, only the transformer’s output is included in the result, not the original columns. In all cases columns not selected by cols are passed through.

rename_columnsstr, default=’{}’

Format strings applied to all transformation output column names. For example pass 'transformed_{}' to prepend 'transformed_' to all output column names. The default value does not modify the names. Renaming is not applied to columns not selected by cols.

Attributes:
all_inputs_list of str

All column names in the input dataframe.

used_inputs_list of str

The names of columns that were transformed.

all_outputs_list of str

All column names in the output dataframe.

created_outputs_list of str

The names of columns in the output dataframe that were created by the fitted transformer.

transformer_Transformer

The fitted transformer.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.eye(4) * np.logspace(0, 3, 4), columns=list("abcd"))
>>> df
     a     b      c       d
0  1.0   0.0    0.0     0.0
1  0.0  10.0    0.0     0.0
2  0.0   0.0  100.0     0.0
3  0.0   0.0    0.0  1000.0
>>> from sklearn.decomposition import PCA
>>> from skrub import ApplyToFrame
>>> ApplyToFrame(PCA(n_components=2)).fit_transform(df).round(2)
     pca0   pca1
0 -249.01 -33.18
1 -249.04 -33.68
2 -252.37  66.64
3  750.42   0.22

We can restrict the transformer to a subset of columns:

>>> pca = ApplyToFrame(PCA(n_components=2), cols=["a", "b"])
>>> pca.fit_transform(df).round(2)
       c       d  pca0  pca1
0    0.0     0.0 -2.52  0.67
1    0.0     0.0  7.50  0.00
2  100.0     0.0 -2.49 -0.33
3    0.0  1000.0 -2.49 -0.33
>>> pca.used_inputs_
['a', 'b']
>>> pca.created_outputs_
['pca0', 'pca1']
>>> pca.transformer_
PCA(n_components=2)

It is possible to rename the output columns:

>>> pca = ApplyToFrame(
...     PCA(n_components=2), cols=["a", "b"], rename_columns='my_tag-{}'
... )
>>> pca.fit_transform(df).round(2)
       c       d  my_tag-pca0  my_tag-pca1
0    0.0     0.0        -2.52         0.67
1    0.0     0.0         7.50         0.00
2  100.0     0.0        -2.49        -0.33
3    0.0  1000.0        -2.49        -0.33

We can also force preserving the original columns in the output:

>>> pca = ApplyToFrame(PCA(n_components=2), cols=["a", "b"], keep_original=True)
>>> pca.fit_transform(df).round(2)
     a     b      c       d  pca0  pca1
0  1.0   0.0    0.0     0.0 -2.52  0.67
1  0.0  10.0    0.0     0.0  7.50  0.00
2  0.0   0.0  100.0     0.0 -2.49 -0.33
3  0.0   0.0    0.0  1000.0 -2.49 -0.33

Methods

fit(X[, y])

Fit the transformer on all columns jointly.

fit_transform(X[, y])

Fit the transformer on all columns jointly and transform X.

get_feature_names_out()

Get output feature names for transformation.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform a dataframe.

fit(X, y=None)[source]#

Fit the transformer on all columns jointly.

Parameters:
XPandas or Polars DataFrame

The data to transform.

yPandas or Polars Series or DataFrame, default=None

The target data.

Returns:
ApplyToFrame

The transformer itself.

fit_transform(X, y=None)[source]#

Fit the transformer on all columns jointly and transform X.

Parameters:
XPandas or Polars DataFrame

The data to transform.

yPandas or Polars Series or DataFrame, default=None

The target data.

Returns:
resultPandas or Polars DataFrame

The transformed data.

get_feature_names_out()[source]#

Get output feature names for transformation.

Returns:
feature_names_outndarray of str objects

Transformed feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Transform a dataframe.

Parameters:
XPandas or Polars DataFrame

The column to transform.

Returns:
resultPandas or Polars DataFrame

The transformed data.