skrub.Expr.skb.apply#

Expr.skb.apply(estimator, *, y=None, cols=all(), exclude_cols=None, how='auto', allow_reject=False, unsupervised=False)[source]#

Apply a scikit-learn estimator to a dataframe or numpy array.

Parameters:
estimatorscikit-learn estimator

The transformer or predictor to apply.

ydataframe, column or numpy array, optional

The prediction targets when estimator is a supervised estimator.

colsstr, list of strings or skrub selector, optional

The columns to transform, when estimator is a transformer.

exclude_colsstr, list of strings or skrub selector, optional

When estimator is a transformer, columns to which it should _not_ be applied. The columns that are matched by cols AND not matched by exclude_cols are transformed.

how“auto”, “columnwise”, “subframe” or “full_frame”, optional

The mode in which it is applied. In the vast majority of cases the default “auto” is appropriate. “columnwise” means a separate clone of the transformer is applied to each column. “subframe” means it is applied to a subset of the columns, passed as a single dataframe. “full_frame” means the whole input dataframe is passed directly to the provided estimator.

allow_rejectbool, optional

Whether the transformer can refuse to transform columns for which it does not apply, in which case they are passed through unchanged. This can be useful to avoid specifying exactly which columns should be transformed. For example if we apply skrub.ToDatetime() to all columns with allow_reject=True, string columns that can be parsed as dates will be converted and all other columns will be passed through. If we use allow_reject=False (the default), an error would be raised if the dataframe contains columns for which ToDatetime does not apply (eg a column of numbers).

unsupervisedbool, optional

Use this to indicate that y is required for scoring but not fitting, as is the case for clustering algorithms. If y is not required at all (for example when applying an unsupervised transformer, or when we are not interested in scoring with ground-truth labels), simply leave the default y=None and there is no need to pass a value for unsupervised.

Returns:
result

The transformed dataframe when estimator is a transformer, and the fitted estimator’s predictions if it is a supervised predictor.

See also

skrub.Expr.skb.get_pipeline

Get a skrub pipeline for this expression.

Examples

>>> import skrub
>>> x = skrub.X(skrub.toy_orders().X)
>>> x
<Var 'X'>
Result:
―――――――
   ID product  quantity        date
0   1     pen         2  2020-04-03
1   2     cup         3  2020-04-04
2   3     cup         5  2020-04-04
3   4   spoon         1  2020-04-05
>>> datetime_encoder = skrub.DatetimeEncoder(add_total_seconds=False)
>>> x.skb.apply(skrub.TableVectorizer(datetime=datetime_encoder))
<Apply TableVectorizer>
Result:
―――――――
    ID  product_cup  product_pen  ...  date_year  date_month  date_day
0  1.0          0.0          1.0  ...     2020.0         4.0       3.0
1  2.0          1.0          0.0  ...     2020.0         4.0       4.0
2  3.0          1.0          0.0  ...     2020.0         4.0       4.0
3  4.0          0.0          0.0  ...     2020.0         4.0       5.0

Transform only the 'product' column:

>>> x.skb.apply(skrub.StringEncoder(n_components=2), cols='product')
<Apply StringEncoder>
Result:
―――――――
   ID     product_0     product_1  quantity        date
0   1 -2.560113e-16  1.000000e+00         2  2020-04-03
1   2  1.000000e+00  7.447602e-17         3  2020-04-04
2   3  1.000000e+00  7.447602e-17         5  2020-04-04
3   4 -3.955170e-16 -8.326673e-17         1  2020-04-05

Transform all but the 'ID' and 'quantity' columns:

>>> x.skb.apply(
...     skrub.StringEncoder(n_components=2), exclude_cols=["ID", "quantity"]
... )
<Apply StringEncoder>
Result:
―――――――
   ID     product_0     product_1  quantity    date_0    date_1
0   1  9.775252e-08  7.830415e-01         2  0.766318 -0.406667
1   2  9.999999e-01  0.000000e+00         3  0.943929  0.330148
2   3  9.999998e-01 -1.490116e-08         5  0.943929  0.330149
3   4  9.910963e-08 -6.219692e-01         1  0.766318 -0.406668

More complex selection of the columns to transform, here all numeric columns except the 'ID':

>>> from sklearn.preprocessing import StandardScaler
>>> from skrub import selectors as s
>>> x.skb.apply(StandardScaler(), cols=s.numeric() - "ID")
<Apply StandardScaler>
Result:
―――――――
   ID product        date  quantity
0   1     pen  2020-04-03 -0.507093
1   2     cup  2020-04-04  0.169031
2   3     cup  2020-04-04  1.521278
3   4   spoon  2020-04-05 -1.183216

For supervised estimators, pass the targets as the argument for y:

>>> from sklearn.dummy import DummyClassifier
>>> y = skrub.y(skrub.toy_orders().y)
>>> y
<Var 'y'>
Result:
―――――――
0    False
1    False
2     True
3    False
Name: delayed, dtype: bool
>>> x.skb.apply(skrub.TableVectorizer()).skb.apply(DummyClassifier(), y=y)
<Apply DummyClassifier>
Result:
―――――――
   delayed
0    False
1    False
2    False
3    False

Sometimes we want to pass a value for y because it is required for scoring and cross-validation, but it is not needed for fitting the estimator. In this case pass unsupervised=True.

>>> from sklearn.datasets import make_blobs
>>> from sklearn.cluster import KMeans
>>> X, y = make_blobs(n_samples=10, random_state=0)
>>> e = skrub.X(X).skb.apply(
...     KMeans(n_clusters=2, n_init=1, random_state=0),
...     y=skrub.y(y),
...     unsupervised=True,
... )
>>> e.skb.cross_validate()["test_score"]
array([-19.43734833, -12.46393769, -11.80428789, -37.23883226,
        -4.85785541])
>>> pipeline = e.skb.get_pipeline().fit({"X": X})
>>> pipeline.predict({"X": X})
array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=int32)