skrub.Expr.skb.apply#
- Expr.skb.apply(estimator, *, y=None, cols=all(), exclude_cols=None, how='auto', allow_reject=False, unsupervised=False)[source]#
Apply a scikit-learn estimator to a dataframe or numpy array.
- Parameters:
- estimatorscikit-learn estimator
The transformer or predictor to apply.
- ydataframe, column or
numpy
array
, optional The prediction targets when
estimator
is a supervised estimator.- cols
str
,list
of strings or skrub selector, optional The columns to transform, when
estimator
is a transformer.- exclude_cols
str
,list
of strings or skrub selector, optional When
estimator
is a transformer, columns to which it should _not_ be applied. The columns that are matched bycols
AND not matched byexclude_cols
are transformed.- how“auto”, “columnwise”, “subframe” or “full_frame”, optional
The mode in which it is applied. In the vast majority of cases the default “auto” is appropriate. “columnwise” means a separate clone of the transformer is applied to each column. “subframe” means it is applied to a subset of the columns, passed as a single dataframe. “full_frame” means the whole input dataframe is passed directly to the provided
estimator
.- allow_reject
bool
, optional Whether the transformer can refuse to transform columns for which it does not apply, in which case they are passed through unchanged. This can be useful to avoid specifying exactly which columns should be transformed. For example if we apply
skrub.ToDatetime()
to all columns withallow_reject=True
, string columns that can be parsed as dates will be converted and all other columns will be passed through. If we useallow_reject=False
(the default), an error would be raised if the dataframe contains columns for whichToDatetime
does not apply (eg a column of numbers).- unsupervised
bool
, optional Use this to indicate that
y
is required for scoring but not fitting, as is the case for clustering algorithms. Ify
is not required at all (for example when applying an unsupervised transformer, or when we are not interested in scoring with ground-truth labels), simply leave the defaulty=None
and there is no need to pass a value forunsupervised
.
- Returns:
- result
The transformed dataframe when
estimator
is a transformer, and the fittedestimator
’s predictions if it is a supervised predictor.
See also
skrub.Expr.skb.get_pipeline
Get a skrub pipeline for this expression.
Examples
>>> import skrub
>>> x = skrub.X(skrub.toy_orders().X) >>> x <Var 'X'> Result: ――――――― ID product quantity date 0 1 pen 2 2020-04-03 1 2 cup 3 2020-04-04 2 3 cup 5 2020-04-04 3 4 spoon 1 2020-04-05
>>> datetime_encoder = skrub.DatetimeEncoder(add_total_seconds=False) >>> x.skb.apply(skrub.TableVectorizer(datetime=datetime_encoder)) <Apply TableVectorizer> Result: ――――――― ID product_cup product_pen ... date_year date_month date_day 0 1.0 0.0 1.0 ... 2020.0 4.0 3.0 1 2.0 1.0 0.0 ... 2020.0 4.0 4.0 2 3.0 1.0 0.0 ... 2020.0 4.0 4.0 3 4.0 0.0 0.0 ... 2020.0 4.0 5.0
Transform only the
'product'
column:>>> x.skb.apply(skrub.StringEncoder(n_components=2), cols='product') <Apply StringEncoder> Result: ――――――― ID product_0 product_1 quantity date 0 1 -2.560113e-16 1.000000e+00 2 2020-04-03 1 2 1.000000e+00 7.447602e-17 3 2020-04-04 2 3 1.000000e+00 7.447602e-17 5 2020-04-04 3 4 -3.955170e-16 -8.326673e-17 1 2020-04-05
Transform all but the
'ID'
and'quantity'
columns:>>> x.skb.apply( ... skrub.StringEncoder(n_components=2), exclude_cols=["ID", "quantity"] ... ) <Apply StringEncoder> Result: ――――――― ID product_0 product_1 quantity date_0 date_1 0 1 9.775252e-08 7.830415e-01 2 0.766318 -0.406667 1 2 9.999999e-01 0.000000e+00 3 0.943929 0.330148 2 3 9.999998e-01 -1.490116e-08 5 0.943929 0.330149 3 4 9.910963e-08 -6.219692e-01 1 0.766318 -0.406668
More complex selection of the columns to transform, here all numeric columns except the
'ID'
:>>> from sklearn.preprocessing import StandardScaler >>> from skrub import selectors as s
>>> x.skb.apply(StandardScaler(), cols=s.numeric() - "ID") <Apply StandardScaler> Result: ――――――― ID product date quantity 0 1 pen 2020-04-03 -0.507093 1 2 cup 2020-04-04 0.169031 2 3 cup 2020-04-04 1.521278 3 4 spoon 2020-04-05 -1.183216
For supervised estimators, pass the targets as the argument for
y
:>>> from sklearn.dummy import DummyClassifier >>> y = skrub.y(skrub.toy_orders().y) >>> y <Var 'y'> Result: ――――――― 0 False 1 False 2 True 3 False Name: delayed, dtype: bool
>>> x.skb.apply(skrub.TableVectorizer()).skb.apply(DummyClassifier(), y=y) <Apply DummyClassifier> Result: ――――――― delayed 0 False 1 False 2 False 3 False
Sometimes we want to pass a value for
y
because it is required for scoring and cross-validation, but it is not needed for fitting the estimator. In this case passunsupervised=True
.>>> from sklearn.datasets import make_blobs >>> from sklearn.cluster import KMeans
>>> X, y = make_blobs(n_samples=10, random_state=0) >>> e = skrub.X(X).skb.apply( ... KMeans(n_clusters=2, n_init=1, random_state=0), ... y=skrub.y(y), ... unsupervised=True, ... ) >>> e.skb.cross_validate()["test_score"] array([-19.43734833, -12.46393769, -11.80428789, -37.23883226, -4.85785541]) >>> pipeline = e.skb.get_pipeline().fit({"X": X}) >>> pipeline.predict({"X": X}) array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=int32)