skrub.DataOp.skb.apply#

DataOp.skb.apply(estimator, *, y=None, cols=all(), exclude_cols=None, how='auto', allow_reject=False, unsupervised=False)[source]#

Apply a scikit-learn estimator to a dataframe or numpy array.

Parameters:
estimatorscikit-learn estimator

The transformer or predictor to apply.

ydataframe, column or numpy array, optional

The prediction targets when estimator is a supervised estimator.

colsstr, list of strings or skrub selector, optional

The columns to transform, when estimator is a transformer.

exclude_colsstr, list of strings or skrub selector, optional

When estimator is a transformer, columns to which it should _not_ be applied. The columns that are matched by cols AND not matched by exclude_cols are transformed.

how“auto”, “cols”, “frame” or “no_wrap”, optional

How the estimator is applied. In most cases the default “auto” is appropriate. - “cols” means estimator is wrapped in a ApplyToCols

transformer, which fits a separate clone of estimator each column in cols. estimator must be a transformer (have a fit_transform method).

  • “frame” means estimator is wrapped in a ApplyToFrame transformer, which fits a single clone of estimator to the selected part of the input dataframe. estimator must be a transformer.

  • “no_wrap” means no wrapping, estimator is applied directly to the unmodified input.

  • “auto” chooses the wrapping depending on the input and estimator. If the input is not a dataframe or the estimator is not a transformer, the “no_wrap” strategy is chosen. Otherwise if the estimator has a __single_column_transformer__ attribute, “cols” is chosen. Otherwise “frame” is chosen.

allow_rejectbool, optional

Whether the transformer can refuse to transform columns for which it does not apply, in which case they are passed through unchanged. This can be useful to avoid specifying exactly which columns should be transformed. For example if we apply skrub.ToDatetime() to all columns with allow_reject=True, string columns that can be parsed as dates will be converted and all other columns will be passed through. If we use allow_reject=False (the default), an error would be raised if the dataframe contains columns for which ToDatetime does not apply (eg a column of numbers).

unsupervisedbool, optional

Use this to indicate that y is required for scoring but not fitting, as is the case for clustering algorithms. If y is not required at all (for example when applying an unsupervised transformer, or when we are not interested in scoring with ground-truth labels), simply leave the default y=None and there is no need to pass a value for unsupervised.

Returns:
result

The transformed dataframe when estimator is a transformer, and the fitted estimator’s predictions if it is a supervised predictor.

See also

skrub.DataOp.skb.make_learner

Get a skrub learner for this DataOp.

skrub.ApplyToCols

Transformer that applies a given transformer separately to each selected column.

skrub.ApplyToFrame

Transformer that applies a given transformer to part of a dataframe.

Examples

>>> import skrub
>>> data = skrub.datasets.toy_orders()
>>> x = skrub.X(data.X)
>>> x
<Var 'X'>
Result:
―――――――
   ID product  quantity        date
0   1     pen         2  2020-04-03
1   2     cup         3  2020-04-04
2   3     cup         5  2020-04-04
3   4   spoon         1  2020-04-05
>>> datetime_encoder = skrub.DatetimeEncoder(add_total_seconds=False)
>>> x.skb.apply(skrub.TableVectorizer(datetime=datetime_encoder))
<Apply TableVectorizer>
Result:
―――――――
    ID  product_cup  product_pen  ...  date_year  date_month  date_day
0  1.0          0.0          1.0  ...     2020.0         4.0       3.0
1  2.0          1.0          0.0  ...     2020.0         4.0       4.0
2  3.0          1.0          0.0  ...     2020.0         4.0       4.0
3  4.0          0.0          0.0  ...     2020.0         4.0       5.0

Transform only the 'product' column:

>>> x.skb.apply(skrub.StringEncoder(n_components=2), cols='product')
<Apply StringEncoder>
Result:
―――――――
   ID     product_0     product_1  quantity        date
0   1 -2.560113e-16  1.000000e+00         2  2020-04-03
1   2  1.000000e+00  7.447602e-17         3  2020-04-04
2   3  1.000000e+00  7.447602e-17         5  2020-04-04
3   4 -3.955170e-16 -8.326673e-17         1  2020-04-05

Transform all but the 'ID' and 'quantity' columns:

>>> x.skb.apply(
...     skrub.StringEncoder(n_components=2), exclude_cols=["ID", "quantity"]
... )
<Apply StringEncoder>
Result:
―――――――
   ID     product_0     product_1  quantity    date_0    date_1
0   1  9.775252e-08  7.830415e-01         2  0.766318 -0.406667
1   2  9.999999e-01  0.000000e+00         3  0.943929  0.330148
2   3  9.999998e-01 -1.490116e-08         5  0.943929  0.330149
3   4  9.910963e-08 -6.219692e-01         1  0.766318 -0.406668

More complex selection of the columns to transform, here all numeric columns except the 'ID':

>>> from sklearn.preprocessing import StandardScaler
>>> from skrub import selectors as s
>>> x.skb.apply(StandardScaler(), cols=s.numeric() - "ID")
<Apply StandardScaler>
Result:
―――――――
   ID product        date  quantity
0   1     pen  2020-04-03 -0.507093
1   2     cup  2020-04-04  0.169031
2   3     cup  2020-04-04  1.521278
3   4   spoon  2020-04-05 -1.183216

For supervised estimators, pass the targets as the argument for y:

>>> from sklearn.dummy import DummyClassifier
>>> y = skrub.y(data.y)
>>> y
<Var 'y'>
Result:
―――――――
0    False
1    False
2     True
3    False
Name: delayed, dtype: bool
>>> x.skb.apply(skrub.TableVectorizer()).skb.apply(DummyClassifier(), y=y)
<Apply DummyClassifier>
Result:
―――――――
0    False
1    False
2    False
3    False
Name: delayed, dtype: bool

Sometimes we want to pass a value for y because it is required for scoring and cross-validation, but it is not needed for fitting the estimator. In this case pass unsupervised=True.

>>> from sklearn.datasets import make_blobs
>>> from sklearn.cluster import KMeans
>>> X, y = make_blobs(n_samples=10, random_state=0)
>>> e = skrub.X(X).skb.apply(
...     KMeans(n_clusters=2, n_init=1, random_state=0),
...     y=skrub.y(y),
...     unsupervised=True,
... )
>>> e.skb.cross_validate()["test_score"]
0   -19.437348
1   -12.463938
2   -11.804288
3   -37.238832
4    -4.857855
Name: test_score, dtype: float64
>>> learner = e.skb.make_learner().fit({"X": X})
>>> learner.predict({"X": X})
array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=int32)