ApplyToFrame#

class skrub.ApplyToFrame(transformer, cols=all(), keep_original=False, rename_columns='{}')[source]#

Apply a transformer to part of a dataframe.

A subset of the dataframe is selected and passed to the transformer (as a single input). This is different from ApplyToCols, which fits a separate clone of the transformer to each selected column independently. All columns not listed in cols remain unmodified in the output.

Note

The transform and fit_transform methods of transformer must return dataframes of the same type (polars or pandas) as the input, either by default or by supporting the scikit-learn set_output API.

Parameters:

transformerscikit-learn Transformer: The transformer to apply to the selected columns. fit_transform and transform must return a DataFrame. The resulting dataframe will appear as the last columns of the output dataframe. Unselected columns will appear unchanged in the output.
colsstr, sequence of str, or skrub selector, optional: The columns to attempt to transform. Only the selected columns will have the transformer applied. Columns outside of this selection are passed through unchanged (fit_transform is not called on them) and remain unmodified in the output. The default is to transform all columns.
keep_originalbool, default=False: If True, the original columns are preserved in the output. If the transformer produces a column with the same name, the transformation result is renamed so that both columns can appear in the output. If False, only the transformer’s output is included in the result, not the original columns. In all cases columns not selected by cols are passed through.
rename_columnsstr, default=’{}’: Format strings applied to all transformation output column names. For example pass 'transformed_{}' to prepend 'transformed_' to all output column names. The default value does not modify the names. Renaming is not applied to columns not selected by cols.

Attributes:

all_inputs_list of str: All column names in the input dataframe.
used_inputs_list of str: The names of columns that were transformed.
all_outputs_list of str: All column names in the output dataframe.
created_outputs_list of str: The names of columns in the output dataframe that were created by the fitted transformer.
transformer_Transformer: The fitted transformer.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.eye(4) * np.logspace(0, 3, 4), columns=list("abcd"))
>>> df
     a     b      c       d
0  1.0   0.0    0.0     0.0
1  0.0  10.0    0.0     0.0
2  0.0   0.0  100.0     0.0
3  0.0   0.0    0.0  1000.0
>>> from sklearn.decomposition import PCA
>>> from skrub import ApplyToFrame
>>> ApplyToFrame(PCA(n_components=2)).fit_transform(df).round(2)
     pca0   pca1
0 -249.01 -33.18
1 -249.04 -33.68
2 -252.37  66.64
3  750.42   0.22

We can restrict the transformer to a subset of columns:

>>> pca = ApplyToFrame(PCA(n_components=2), cols=["a", "b"])
>>> pca.fit_transform(df).round(2)
       c       d  pca0  pca1
0    0.0     0.0 -2.52  0.67
1    0.0     0.0  7.50  0.00
2  100.0     0.0 -2.49 -0.33
3    0.0  1000.0 -2.49 -0.33
>>> pca.used_inputs_
['a', 'b']
>>> pca.created_outputs_
['pca0', 'pca1']
>>> pca.transformer_
PCA(n_components=2)

It is possible to rename the output columns:

>>> pca = ApplyToFrame(
...     PCA(n_components=2), cols=["a", "b"], rename_columns='my_tag-{}'
... )
>>> pca.fit_transform(df).round(2)
       c       d  my_tag-pca0  my_tag-pca1
0    0.0     0.0        -2.52         0.67
1    0.0     0.0         7.50         0.00
2  100.0     0.0        -2.49        -0.33
3    0.0  1000.0        -2.49        -0.33

We can also force preserving the original columns in the output:

>>> pca = ApplyToFrame(PCA(n_components=2), cols=["a", "b"], keep_original=True)
>>> pca.fit_transform(df).round(2)
     a     b      c       d  pca0  pca1
0  1.0   0.0    0.0     0.0 -2.52  0.67
1  0.0  10.0    0.0     0.0  7.50  0.00
2  0.0   0.0  100.0     0.0 -2.49 -0.33
3  0.0   0.0    0.0  1000.0 -2.49 -0.33

Methods

`fit`(X[, y])	Fit the transformer on all columns jointly.
`fit_transform`(X[, y])	Fit the transformer on all columns jointly and transform X.
`get_feature_names_out`()	Get output feature names for transformation.
`get_params`([deep])	Get parameters for this estimator.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X, **kwargs)	Transform a dataframe.

fit(X, y=None, **kwargs)[source]#

Fit the transformer on all columns jointly.

Parameters:

XPandas or Polars DataFrame: The data to transform.
yPandas or Polars Series or DataFrame, default=None: The target data.
**kwargs: Extra named arguments are passed to the fit_transform() method of self.transformer.

Returns:

ApplyToFrame: The transformer itself.

fit_transform(X, y=None, **kwargs)[source]#

Fit the transformer on all columns jointly and transform X.

Parameters:

XPandas or Polars DataFrame: The data to transform.
yPandas or Polars Series or DataFrame, default=None: The target data.
**kwargs: Extra named arguments are passed to the fit_transform() method of self.transformer.

Returns:

resultPandas or Polars DataFrame: The transformed data.

get_feature_names_out()[source]#

Get output feature names for transformation.

Returns:

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

transform(X, **kwargs)[source]#

Transform a dataframe.

Parameters:

XPandas or Polars DataFrame: The column to transform.
**kwargs: Extra named arguments are passed to the transform() method of self.transformer_.

Returns:

resultPandas or Polars DataFrame: The transformed data.

Gallery examples#

Hands-On with Column Selection and Transformers

ApplyToFrame#

Gallery examples#

This Page