ApplyToCols#
- class skrub.ApplyToCols(transformer, cols=all(), allow_reject=False, keep_original=False, rename_columns='{}', n_jobs=None)[source]#
Map a transformer to columns in a dataframe.
A separate clone of the transformer is applied to each column separately. Moreover, if
allow_reject
isTrue
and the transformers’fit_transform
raises aRejectColumn
exception for a particular column, that column is passed through unchanged. Ifallow_reject
isFalse
,RejectColumn
exceptions are propagated, like other errors raised by the transformer.Note
The
transform
andfit_transform
methods oftransformer
must return a column, a list of columns or a dataframe of the same module (polars or pandas) as the input, either by default or by supporting the scikit-learnset_output
API.- Parameters:
- transformerscikit-learn
Transformer
The transformer to map to the selected columns. For each column in
cols
, a clone of the transformer is created thenfit_transform
is called on a single-column dataframe. If the transformer has a__single_column_transformer__
attribute,fit_transform
is passed directly the column (a pandas or polars Series) rather than a DataFrame.fit_transform
must return either a DataFrame, a Series, or a list of Series.fit_transform
can raiseRejectColumn
to indicate that this transformer does not apply to this column – for example theToDatetime
transformer will raiseRejectColumn
for numerical columns. In this case, the column will appear unchanged in the output.- cols
str
, sequence ofstr
, or skrub selector, optional The columns to attempt to transform. Columns outside of this selection will be passed through unchanged, without attempting to call
fit_transform
on them. The default is to attempt transforming all columns.- allow_reject
bool
, default=False Whether the transformer is allowed to reject a column by raising a
RejectColumn
exception. IfTrue
, rejected columns will be passed through unchanged byApplyToCols
and will not appear in attributes such astransformers_
,used_inputs_
, etc. IfFalse
, column rejections are considered as errors andRejectColumn
exceptions are propagated.- keep_original
bool
, default=False If
True
, the original columns are preserved in the output. If the transformer produces a column with the same name, the transformation result is renamed so that both columns can appear in the output. IfFalse
, when the transformer accepts a column, only the transformer’s output is included in the result, not the original column. In all cases rejected columns (or columns not selected bycols
) are passed through.- rename_columns
str
, default=’{}’ Format string applied to all transformation output column names. For example pass
'transformed_{}'
to prepend'transformed_'
to all output column names. The default value does not modify the names. Renaming is not applied to columns not selected bycols
.- n_jobs
int
, default=None Number of jobs to run in parallel.
None
means 1 unless in a joblibparallel_backend
context.-1
means using all processors.
- transformerscikit-learn
- Attributes:
- all_inputs_
list
ofstr
All column names in the input dataframe.
- used_inputs_
list
ofstr
The names of columns that were transformed.
- all_outputs_
list
ofstr
All column names in the output dataframe.
- created_outputs_
list
ofstr
The names of columns in the output dataframe that were created by one of the fitted transformers.
- input_to_outputs_
dict
Maps the name of each column that was transformed to the list of the resulting columns’ names in the output.
- output_to_input_
dict
Maps the name of each column in the transformed output to the name of the input column from which it was derived.
- transformers_
dict
Maps the name of each column that was transformed to the corresponding fitted transformer.
- all_inputs_
Examples
>>> import pandas as pd >>> from skrub import ApplyToCols >>> from sklearn.preprocessing import StandardScaler >>> df = pd.DataFrame(dict(A=[-10., 10.], B=[-10., 0.], C=[0., 10.])) >>> df A B C 0 -10.0 -10.0 0.0 1 10.0 0.0 10.0
Fit a StandardScaler to each column in df:
>>> scaler = ApplyToCols(StandardScaler()) >>> scaler.fit_transform(df) A B C 0 -1.0 -1.0 -1.0 1 1.0 1.0 1.0 >>> scaler.transformers_ {'A': StandardScaler(), 'B': StandardScaler(), 'C': StandardScaler()}
We can restrict the columns on which the transformation is applied:
>>> scaler = ApplyToCols(StandardScaler(), cols=["A", "B"]) >>> scaler.fit_transform(df) A B C 0 -1.0 -1.0 0.0 1 1.0 1.0 10.0
We see that the scaling has not been applied to “C”, which also does not appear in the transformers_:
>>> scaler.transformers_ {'A': StandardScaler(), 'B': StandardScaler()} >>> scaler.used_inputs_ ['A', 'B']
Rejected columns
The transformer can raise
RejectColumn
to indicate it cannot handle a given column.>>> from skrub._to_datetime import ToDatetime >>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"])) >>> df birthday city 0 29/01/2024 London >>> df.dtypes birthday object city object dtype: object >>> ToDatetime().fit_transform(df["birthday"]) 0 2024-01-29 Name: birthday, dtype: datetime64[...] >>> ToDatetime().fit_transform(df["city"]) Traceback (most recent call last): ... skrub._apply_to_cols.RejectColumn: Could not find a datetime format for column 'city'.
How these rejections are handled depends on the
allow_reject
parameter. By default, no special handling is performed and rejections are considered to be errors:>>> to_datetime = ApplyToCols(ToDatetime()) >>> to_datetime.fit_transform(df) Traceback (most recent call last): ... ValueError: Transformer ToDatetime.fit_transform failed on column 'city'. See above for the full traceback.
However, setting
allow_reject=True
gives the transformer itself some control over which columns it should be applied to. For example, whether a string column contains dates is only known once we try to parse them. Therefore it might be sensible to try to parse all string columns but allow the transformer to reject those that, upon inspection, do not contain dates.>>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True) >>> transformed = to_datetime.fit_transform(df) >>> transformed birthday city 0 2024-01-29 London
Now the column ‘city’ was rejected but this was not treated as an error; ‘city’ was passed through unchanged and only ‘birthday’ was converted to a datetime column.
>>> transformed.dtypes birthday datetime64[...] city object dtype: object >>> to_datetime.transformers_ {'birthday': ToDatetime()}
Renaming outputs & keeping the original columns
The
rename_columns
parameter allows renaming output columns.>>> df = pd.DataFrame(dict(A=[-10., 10.], B=[0., 100.])) >>> scaler = ApplyToCols(StandardScaler(), rename_columns='{}_scaled') >>> scaler.fit_transform(df) A_scaled B_scaled 0 -1.0 -1.0 1 1.0 1.0
The renaming is only applied to columns selected by
cols
(and not rejected by the transformer whenallow_reject
isTrue
).>>> scaler = ApplyToCols(StandardScaler(), cols=['A'], rename_columns='{}_scaled') >>> scaler.fit_transform(df) A_scaled B 0 -1.0 0.0 1 1.0 100.0
rename_columns
can be particularly useful whenkeep_original
isTrue
. When a column is transformed, we can tellApplyToCols
to retain the original, untransformed column in the output. If the transformer produces a column with the same name, the transformation result is renamed to avoid a name clash.>>> scaler = ApplyToCols(StandardScaler(), keep_original=True) >>> scaler.fit_transform(df) A A__skrub_89725c56__ B B__skrub_81cc7d00__ 0 -10.0 -1.0 0.0 -1.0 1 10.0 1.0 100.0 1.0
In this case we may want to set a more sensible name for the transformer’s output:
>>> scaler = ApplyToCols( ... StandardScaler(), keep_original=True, rename_columns="{}_scaled" ... ) >>> scaler.fit_transform(df) A A_scaled B B_scaled 0 -10.0 -1.0 0.0 -1.0 1 10.0 1.0 100.0 1.0
Methods
fit
(X[, y])Fit the transformer on each column independently.
fit_transform
(X[, y])Fit the transformer on each column independently and transform X.
Get output feature names for transformation.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform a dataframe.
- fit(X, y=None)[source]#
Fit the transformer on each column independently.
- Parameters:
- XPandas or Polars DataFrame
The data to transform.
- yPandas or Polars
Series
or DataFrame, default=None The target data.
- Returns:
- ApplyToCols
The transformer itself.
- fit_transform(X, y=None)[source]#
Fit the transformer on each column independently and transform X.
- Parameters:
- XPandas or Polars DataFrame
The data to transform.
- yPandas or Polars
Series
or DataFrame, default=None The target data.
- Returns:
- resultPandas or Polars DataFrame
The transformed data.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.